Additional Instances Cannot Be Launched: A Technical Perspective

datacenter2

Fair warning: I’m about to go full geek on you guys here. I don’t work for Blizzard, otherwise I don’t think I’d be able to make a post like this. Not that I’m giving away anything here; this is all highly speculative, but I can happily say that when I’m not writing, I work in the management team of an IT department at a multi-million dollar company. We’re not huge, but because of the nature of what we do, we make an incredible investment in technology – both new, bleeding edge tech that really hasn’t been vetted out (and yes, we pay for it sometimes in man hours and product failures) and in finding out that a lot of technology available simply isn’t up to the task of the work that we do. Our datacenter easily has over a petabyte of data at rest, and servers with four quad-core processors and 128GB of RAM in them are a normal buy for us…and we buy them so often because our production jobs on those boxes still take days to complete.

Now we’re heavy on processing data, crunching information, re-arranging bits and bytes. We’re no Blizzard, with its round-the-clock uptime requirements (although we don’t get a weekly all-day maintenance window like they do) and 11 million customers banging away at our applications and hardware every day. But I do know a little something about capacity planning, and I do know a little something about resource management, and I think I can throw Blizzard a bone here on the whole “Additional Instances Cannot Be Launched” issue as well as help players understand why Blizzard is being so hush-hush on the nature of the problem.

The “Additional Instances” problem has resurfaced to a large extent recently, with the new patch, new content that presses players to go back and experience older instances and content, and across a number of different WoW-related blogs and podcasts I’ve heard all but consistent grumbling about it. Over on the WoW Insider Show (the podcast of WoW.com), I heard Michael Sacco (who used to work for Blizzard and is under NDA) explain that Blizzard is taking it seriously and doing some cool stuff to try and fix it. That got me thinking: “What kind of cool stuff is my company doing to optimize our datacenter? Maybe we’re similar to Blizzard?” Over on RawrCast, I’ve heard Stompalina and Haf complain about it several times.

That’s not to say there isn’t something to complain about. It’s downright frustrating not being able to play the content you pay to play. At its heart, I think it’s a hardware problem, and here’s what I can infer: (I leave open the possibility that I’m completely wrong, by the way.)

vmware

Each “instance” is essentially a virtual environment dedicated to the players inside it. The world acts and re-acts only to their actions, and is isolated from the outside world aside from certain signals like chat. This means that it’s highly likely that each time a group runs into an instance, a Blizzard VMWare (if that’s the virtualization product they’re using – they could be using Xen or Citrix or anything else) server cluster somewhere spins up a virtual machine (VM) specifically for that group of players. It runs in memory because it’s dedicated to them and it doesn’t need a ton of resources, and because it’s not going to be persistent, it’s likely not written to disk.

That is until the players down a boss and are saved to the instance. Then the version of the VM (the instance cleared up to and including that boss) is saved off to disk and the players get a unique identifier for that instance (their raid ID) that indicates the last time they were in there, that’s where they were. This is how instances lock you out and remember where you were if you leave and come back. The tie-in to bosses being downed also explains why the trash comes back if you don’t clear the boss that “owns them,” leave, and come back later. The VM saves some things in memory, and after a certain period will save your ID along with the state of your VM off to disk and shuts it down so it can free up memory to launch additional instances. Now when the VM is running in memory, it’s important to note that it’s just like a server, so it’s transparent to the user like you or me that it’s virtual.

All those rules around how instances behave, how they’re saved, how they lock you out so you can’t just keep running them over and over (aside from normals, which are presumably easy enough people can run them frequently and free up resources quickly) and how Blizzard is willing to give you LONGER lockouts but very rarely shorter ones or removes them? Those rules have just as much to do with the infrastructure that manages those instances as it has to do with the gameplay.

It’s likely that even the realms are virtualized or clustered somehow – it’s unlikely that each realm is a single server frankly, especially considering how continents can crash without bringing a realm down. Most players have seen times when “Outlands goes down” and all of the members of a specific guild out there are suddenly disconnected. It’s likely that realms themselves are server clusters, each responsible for their own parts of the world, working in close conjunction. The instances are likely virtual server farms either running on that hardware or hanging off of them somehow.

server-down

So now that we have kind of a map of the hardware behind a “realm,” which likely consists of a cluster of “world servers” and likely another cluster or a link to a farm of “instance servers” that’s probably a high-performance VM farm, we can dig into why players keep running into problems launching additional instances.

Blizzard’s first priority is naturally their world servers. If they start getting queues or their world servers start performing poorly, they likely upgrade those world servers first to allow for extra capacity and users connecting to them. If the VM farm that makes up the instances starts to have capacity problems, like the “instances cannot be launched,” you can add more servers to meet that demand as well, but there are a few limiting factors that make tossing more hardware at a virtualized farm more difficult. First – depending on what virtualization tool you use, you have to be picky about the hardware you add. If you want redundancy and failover capability, you need to make sure you add like hardware and always leave yourself enough resources in the farm that if an element of the farm goes down, you can duplicate it and bring it up on available resources quickly.

In the IT world, this usually means running your farm in an “n+1 configuration,” meaning that for every “n” number of physical servers you have in your farm, you have “+1” standing by at the ready for failover in case one of those “n” takes a dive. So if I have a farm consisting of 5 really beefy physical servers, capable of hosting hundreds of instances, I’m only really using four of them, because the fifth is standing by to help me recover in case one of them goes down.

Now, pretend you’re Blizzard, and you bought that hardware perhaps three to five years ago. In server years, that’s an eternity. Your back-end hardware is likely end-of-life according to your manufacturers, maintenance costs on that hardware is probably mounting (at my company, when we buy a server, we buy a 3-year maintenance agreement on it that gives us parts and service from the manufacturer. After that, the same level of maintenance costs more, and the costs mount each year we renew it again, usually going up at least 20% per year!) and that old hardware is likely taking a toll on your technical operations folks: it’s likely slow to patch up, underpowered for new tools and under-performs with each new revision and update to the game – so with each passing patch and expansion, Blizzard needs to squeeze more performance, more graphics, and more optimized code out of their developers, and then run it on hardware that’s aging.

The hardware that makes up your realms has already been upgraded over time – we’ve had some pretty major downtime on the realms in the past (I think everyone remembers) when Blizzard has done some sweeping major upgrades to the infrastructure. But what about the instances? I think the whole “Additional Instances Cannot Be Launched” error is an indicator that Blizzards virtual server farm is starting to get old, and even possibly running on older virtualization software. They can’t just toss on more servers, because anything they buy now will be mismatched compared to what’s already in the farm – adding one server now may be the equivalent of adding 3 servers a year ago, which would seem like a good thing, but it’s a manageability and redundancy nightmare, assuming it’s even supported by their virtualization software, whatever they’re using.

The game offers more instances now than it ever has before, has more players trying to get into those instances more than ever before. What does Blizzard do that is, according to Sacco, “really cool shit” to fix it?

server_blade

My speculation: a next generation virtual server farm, likely running new virtualization software on brand new hardware – likely high-performance blade servers. Even minor players in the blade server market make blades – I mean servers the size of your Mac Mini – that sport Intel i7-based quad-core processors and up to their maximum of 144GB of RAM. If they wanted to go AMD (which they probably don’t considering the i7 beats the hell out of the newest 6-core opterons, and besides, virtualization is more RAM dependent than it is CPU dependent) they could get beefy blades with four 6-cores (that’s 24 processors!) and 256GB of RAM In them. Slap 8 of those AMDs or 12 of those Intels in a blade chassis, and you’re kicking ass and taking names – performing drastically better than your existing environment and likely reducing cost in datacenter space, power, cooling, and in the cost of physical stand-alone servers, if that was Blizzard’s strategy before – they could already be using blade technology and simply planning an upgrade.

Blizzard probably did this already with the realms – remember that massive downtime they took about a year ago that seem to take perpetually longer than planned? They thought they would have it done Tuesday and Wednesday but then somehow it wasn’t until the weekend that the realms were stable? That maintenance had all of the markers of a physical-to-virtual migration on it; file copy operations that took forever, P2V migrations that weren’t quite perfect and required intermittent restarts and hotfixes to start (likely applying optimized game code, even drivers and firmware for the new hardware – things that would require a quick reboot but nothing more).

If you go the Guild Wars route, where players host “instanced worlds” when they group with other players (or snag a bunch of NPCs) and decide to go hunting, you can assume that the average “instance” requires minimal resources – perhaps a relatively well powered desktop computer, maybe a single-core processor, dual if you’re lucky, and a gig or two of RAM (remember, even in Guild Wars, you hosting the instance for your party didn’t completely eat your machine, it just took a bite out of it), right? Think of how many fractions of processors and single-gigs of RAM you could carve out of an 8-core Intel i7 with 144GB of RAM, versus a five-year old 4-core (2 duals) Opteron that maxed out at 32GB of RAM, and you see that if I’m right and Blizzard’s problem is hardware related, they have a pretty hot solution lined up, and because of the size, scale, and amount of testing they’d need to do in order to facilitate it, it would qualify as “cool shit,” and it would take a LOT of time, energy, and testing to put into place.

ss02

All this, and I haven’t even touched the software side of things, which is obviously already scalable since the game has grown so much in its life-span. This problem could very well have been the root of the authentication issues we had a few months ago, and since that’s likely an easier pool of servers to manage, this type of upgrade to their capacity may have resolved it.

So the next time you’re really irritated at the additional instances error, try to remember a little of my theory and tell it to your guildmates. The players never really do get a glimpse into what’s running behind the scenes in the datacenters and co-los that Blizzard likely uses around the world to host its servers, but I have a feeling this is what they’re looking at doing for their instances. After all, it’s what my company is thinking about doing to expand and improve its virtualization efforts. Virtualization is seriously the next big initiative for enterprise IT around the world, both because it offers cost and manageability savings and it’s incredibly environmentally friendly. For Blizzard, who likely spends millions every year on technology maintenance, repairs, bandwidth, parts, and services, it could represent a huge boon for them.

3 Comments so far

  1. buffd.net (unregistered) on August 21st, 2009 @ 2:57 am

    Additional Instances Cannot Be Launched: A Technical Perspective | Azeroth Metblogs…

    Fair warning: I’m about to go full geek on you guys here.

    The “Additional Instances” problem has resurfaced to a large extent recently. That’s not to say there isn’t something to complain about. It’s downright frustrating not being able to play the co…


  2. Killer (unregistered) on August 28th, 2009 @ 1:56 pm

    I screenshotted my 1001 attempts at getting into Scholomance last night. That was between 6-9pm server time. Night before last I tried 525 times and gave up at 1:00am server time. I came back to Scholomance at midnight last night and got right in, then went to Stratholme and got right into that at 1:00am.

    The late runs are killing me, my butt is dragging at work, Its so frustrating that I have to play at 1am to actually get into an instance.

    Blizzard may have aging software and hardware that breaks/overloads but one thing that seems to always work is their billing/payment center. How about that.


  3. Tracey (unregistered) on August 30th, 2009 @ 4:41 am

    “Blizzard may have aging software and hardware that breaks/overloads but one thing that seems to always work is their billing/payment center. How about that.”

    Amen



Terms of use | Privacy Policy | Content: Creative Commons | Site and Design © 2009 | Metroblogging ® and Metblogs ® are registered trademarks of Bode Media, Inc.