A couple of weeks ago Second Life was subject to a Grid Roll, on a Friday of all days. Friday is not a good day for doing work like this, for a start, if it all goes wrong, people might have to work on Saturday and people generally don’t want to have to work on a Saturday when they could be at home relaxing.
However the reasons for the Grid Roll and the inevitable disruption it caused have been carefully explained by April Linden in a blog post : Why the Friday Grid Roll? Why indeed, it’s a good question and April provides a good answer. The brief answer is that there was a bit of a security storm, which needed to be fixed quickly, but which also need to be handled in a quiet manner to enable the Grid Roll to be completed with enough care and attention that it didn’t cause further disruption.
The post also gives us a useful insight into how different teams at Linden Lab work together. The Security team handed the issue over to the Operations Team, who submitted the fixes to ensure the security issue was dealt with. However at that stage, the fixes aren’t live.
The Development and Release teams then do their stuff and pull the fixes into the server image but Linden Lab aren’t finished yet.
Before the fixes are unleashed upon an unsuspecting virtual world public, the Quality Assurance (QA) team check that the fixes aren’t causing chaos for Second Life and after they are happy, then it all starts to get rolled out, but with use of the Release Channels to ensure that really, these fixes aren’t going to cause chaos.
Once they were happy that really, chaos was not going to ensue, they rolled the fixes out to the rest of the grid, and the day for that rollout was Friday. This of course did cause disruption, but with the issue being a rather important security issue, Linden Lab really didn’t have the luxury of waiting for the weekend to be over before rolling this out. The basic timeline was :
- Tuesday – Issue Identified
- Tuesday Night/ Early hours of Wednesday – Operations Team provide fixes.
- Wednesday PM – Development & Release Team implement fixes into Server code. Security Team confirm fix works.
- Wednesday Middle Of Night – QA Team confirm that Second Life is performing fine with fixes running.
- Thursday AM – Fixed Code rolled out to Release Channels and monitored for rest of the day.
- Friday – Linden Lab are happy and rollout the code to the whole grid.
The blog post explains why there was some urgency behind this roll and why communication wasn’t as open as people may have wished :
We took Thursday to watch the RC channels and make sure they were still performing well, and then went ahead and rolled the security update to the rest of the grid on Friday.
Just to make it clear, we saw no evidence that there was any attempt to use this security issue against Second Life. It was our mission to make sure it stayed that way!
The reason there was little notice for the roll on Thursday is two fold. First, we were moving very quickly, and second because the roll was to mitigate a security issue, we didn’t want to tip our hand and show what was going on until after the issue had been fully resolved.
I find blog posts like this informative and interesting, but this is the sort of thing I sometimes encounter in my day job, so I have great sympathy with the Linden Lab teams regarding this.
April has explained the situation in a relatively easy to follow fashion and it’s good to see that Linden Lab, even when faced with a security issue, don’t panic … well maybe they had a little bit of panic once they realised what they had to do, but it was well managed.