Friday with Means - June 26th, 2009 - The dead awaken and consume coffee.
This update was the perfect time to release a /facepalm animation. Most of the team is slowly recovering from being awake for almost 48 hours straight.
There were a great many things in the update that we thought might destroy the universe.
A new client RDB for everyone that allows us to add more files and process faster.
Instanced cities for everyone.
Instanced playfields finally being displayed with the correct lighting and effects they were always meant to have.
Messing with the City Advantages code.
New apartments accessed in a new way.
New Hostile Target nano window....
Any of these could have done it...as easily in the past it could have been the social window...the mechs...the battlestations...or the raid system...or the new instancing system itself. So many opportunities to break the entire game...and force the rollback we have managed to avoid for about 6 years..since 12.6.
What actually did it?
The new system for cool-downs for nanos. In the past if we wanted an action to be unavailable for use (beyond the recharge/global cooldown of everything) we were forced to depend on silly blocker nanos..."You can not cast this while XXXXX is running." The new cooldowns meant we could dispense with the silly blocker nanos and properly display (on the shortcut bar even!) the amount of time until these actions were available again. This is where things went "bad".
"Crashing" or exiting the game very rapidly would break these cooldowns and corrupt your character information. Anything contained in your character after this cooldown information was corrupted...this resulted in your research and perks effectively vanishing. Importantly your "Save" position also became corrupt...anyone who died in this state would then crash the entire server...not surprisingly breaking other characters on the same server who were also in this cooldown state. One broken character would then create more crashes...then more broken characters as a result...and from there it began to spread across all dimensions like a virus. Doctors had the longest cooldown of this kind in common use...so they were immediately the first players who were susceptible to this issue...and in the end what allowed us to locate the problem and eliminate it in the update we released at 5am on Thursday. The sheer number of characters affected by this issue left us no choice but to roll back the entire day...with no readily available way of detecting this kind of corruption in the database it was far preferable to months of crashes as these characters appeared on the servers and hurt them again....and other characters by crash issues that will always persist (ie: lost items due to server crashes).
I am very sorry it took as long as it did to locate the issue...all the posts from the community on this issue were instrumental in helping us locate it as quickly as we did...the sheer horrible random nature of this issue (in terms of the symptoms of vanishing perks and research) made it very hard to identify. We are very lucky to have the talented and dedicated coders, designers and operational support teams that we have otherwise this issue could have been plaguing us for a much longer time.
The population on Test was also incredibly helpfull in helping us put an end to this. It is also not fair to say that anything done on test or not done on test could have prevented this issue. Only a server crash during a precise 10 second window would have produced this result...and that is not the fault of the test population...they were extremely helpfull in making sure the other significantly risky improvements to AO in this update ran as smoothly as they did and we are very gratefull for their help. Thank-You Nirvelle for lending us your account! Nothing short of a huge live environment was going to expose this weakness...While I am sorry it happened at all I do think that it could not have been repaired as quickly or as well as it was in any other conditions.
Stepping into the AO universe and making the kinds of changes we have been making for the past year are always horribly risky. I also like to think we are really beginning to make progress...and despite the loss of 11 hours this week I feel like we are starting to make good progress. Again I am sorry that any of the pain and suffering we normally deal with when combating the old issues of AO spilled out onto our players. I hope we can enjoy another 6 years before this kind of issue gets us again.
I am going to choose to see the positive aspects of what we are all achieving together in making AO better rather than dwelling on this incident and letting it scare us away from making the dangerous changes that are necessary to make AO the game it always should have been. I love AO now...but it can be so much better as long as we never give up. We'll get there.
This week despite disasters:
Macrosun still found time this week to start work on an in-game mail system.
Genele continued work on the 150-200 new instance.
More balancing work and meetings abounded...we are just getting started.
Thank you to all of you who provide the support and inspiration to the whole team to make the super-human effort required to run AO worth every grey hair. Making us laugh at 4am wasn't easy...but you managed it. Thanks again.
Have a great Anarchy Online Birthday! Thank you all for helping us make it to an amazing 8 years. I'm looking forward to the next 8!
Colin "Means" Cragg
This friday with Means can also be found in the Anarchy Online Forums.