For last week's article about server stability, I asked some questions of Andrew Mann, an old friend whose job it is (well, among many of his jobs) to keep the servers running at a major MMO studio. Of course, he gave me so much information that our chat became a feature of its own. I must state up front that the material in quotes is from Andrew. The material not in quotes consists of my own thoughts and observations. You'd think that would go without saying, wouldn't you? Ha! But that's a different topic entirely. Read on for an old insider's look at stability bugs and more.
The first thing I asked him was to give me a little behind the scenes peek at how stability problems get solved.
He said, "Stability problems start out as bugs, but turn into prioritization. When you find a bug that affects gameplay, you look at the level of impact on the players - how often, how many players, and how badly their experience is impacted. You compare that against an educated guess by the engineers as to how long it'll take to fix.
"That tells you how much of your engineering resources you can afford to dedicate to the issue. There's an upper limit as well. Unlike development, where you can develop multiple features in parallel, you can only really have one team working on a bug. If you put too many people on the bug, they won't be able to coordinate their efforts, and you'll have a lot of wasted work."
And speaking of debugging: "There also aren't a lot of people in the world that are good at debugging. The more complex the system gets, the harder it is to find someone that can figure out bugs in it. And if you can't reproduce the bug in your test environment, it's even worse. You have to work on a guess about what could have caused the varied reports of symptoms. For example, before the region crashed, all the monsters went crazy. Every time the region crashed, Bob was just about to kill the third head of the hydra with his spectral pet that was buffed from a dead player. Etc."
As obscure as these bugs can be, finding them is insanely difficult. A brand new MMO has millions of lines of code, and programmers in crunch mode aren't exactly famous for their documentation. Finding these bugs in games with a few years under their collective belts is even harder. Every new programmer to pass through the company leaves his own stamp on things. Every programmer who leaves the company takes his knowledge with him, and that includes his knowledge of how he did what and why. So, at what point is an MMO such a mass of spaghetti code that it's a miracle it runs at all?
Andrew is too modest to claim to have an answer that applied to the entire MMO industry, but he did say he was starting to think the answer relates to the number of coders that have passed through the company.
"Programmers always think most other programmer's code is ugly," he said. "Code is written in the way we think - programmers as a profession are like a crowd of self taught writers. Courses focus on the vocabulary and sentence structure, but consider style as something not worth teaching or considering. So when new programmers come in, they look at what everyone else wrote, and they decide that they don't like most of it. It doesn't make any sense to them, it's hard to read, etc. And they either isolate their work from it, or rewrite it."
One solution is management: "If you have a consistent lead programmer that's basically forcing people to write code in a certain way or leave the company, then your code at least looks consistent. That means that someone new coming in can either say "Wow, I really hate this," or "I can deal with this."
Professional management in a young industry is a challenge, of course, but it's doable. But if the company is plagued by heavy turnover among the leads (for reasons both the studio's fault and not), having a consistent lead doesn't help. Maximum effectiveness calls for the same lead to run the show.
He points out: "When you have five lead programmers in three years, it's not so clear at the end. Almost everyone hates some things - different things - and almost everyone is okay with some things."
One thing everyone is okay with is making the game stable. But making repairs isn't a matter of unplugging a widget from a socket and replacing it with a new widget. It's more like repairing smashed rear fender on a car, for a customer who just wants to get back out on the road ten minutes ago.
"[Problem solving is] also a matter of intelligently deciding when to hack a solution in, and when to redo part of the design. I used to think that all engineers really wanted to *redo* the design when structural bugs came up, and they only grudgingly backed down when they realized that the effort required is unrealistic. I've met some engineers, though, that seem to prefer to hack around everything. So the code they write tends to have hack on top of hack - which can end slowing development as much as rewriting things constantly."
I can smell the comments from the peanut gallery in response to that one. Let me attempt to forestall them by saying that an MMO studio tends to judge its employees primarily by results. If a solution to a problem with the live game is produced quickly (especially if the problem is preventing the paying customers from playing, and they're raising hell over the delay), the only question will be "does it work?" If the answer is yes, the employee gets approval. Sure, if there's time to pull out the dent, fill it in with bondo, sand, repaint, wax, and cure, any decent programmer will want to. But even the best often have to settle for yanking the metal away from the tire and spraypainting the exposed metal. It makes more word down the road, but it does get the car on that road.