Linden Lab have blogged about the reasons for the maintenance this week : The Hardware Issues Behind Recent Region Restarts. The problem, the blog explains, was down to a hardware failure which first manifested itself in July when the hardware failure took down four of Linden Lab’s new generation hosts. These things happen and Linden Lab put it down to, these things happen. I’m not surprised, that would be my first port of call too because you don’t want to think it may be a nasty hardware fault.
However in early October the same thing happened with another four hosts, which would have raised suspicions. A fortnight later the same fault manifested itself with another four hosts, at which time Linden Lab fully realised that this wasn’t just one of those things. Linden Lab’s blog post is very transparent on what happened here and also gives us some insight into how things work server side :
Each host lives inside a chassis along with three other hosts. These four hosts all share a common backplane that provides the hosts with power, networking and storage. The failures were traced to an overheating and subsequent failure of a component on these backplanes.
After exhaustive investigation with our vendor, the root cause of the failures turned out to be a hardware defect in a backplane component. We arranged an on-site visit by our vendor to locate, identify, and replace the affected backplanes. Members of our operations team have been working this week with our vendor in our datacentre to inspect every potentially affected system and replace the defective component to prevent any more failures.
Now the question some may have is “Why didn’t Linden Lab explain this at the start of the week?” The answer I suspect is that Linden Lab wanted to be sure that the issue was being fixed during this maintenance window before informing their customers of the details.
Linden Lab are confident that no more failures of this type will happen in the future, which is a step more than I’d have gone in such a message, but I admire their confidence :
The region restarts that some of you have experienced this week were an unfortunate side-effect of this critical maintenance work. We have done our best to keep these restarts to a minimum as we understand just how disruptive a region restart can be. The affected machines have been repaired, and returned to service and we are confident that no more failures of this type will occur in the future. Thank you all for your patience and understanding as we have proceeded through the extended maintenance window this week.
However it should be noted that the Grid Status page has not yet been updated to inform us that maintenance is completet the Grid Status page now informs us that the maintenance is complete!
The excellent Grid Status Page should get greater visibility on the main Second Life website. Theres a lot of good information there that could alleviate many a frustrating time for residents. Linden Lab have communicated the issues very well, I very much welcome the much improved communication that Ebbe Altberg promised.
Now let’s hope that this sort of transparent communication continues.
The answer, I suspect, was LL trying to hide the real problems which they’ve shown is their pattern in the past.
Try and be more honest.
What do you think the real reasons are? I believe LL here, now I say this as someone who has been in a similar situation due to hardware vendor failure.
I’ve been in a position where LL have explained an issue and it has pissed me off greatly, I’m talking about the OpenSpace fiasco here and I will never ever forgive LL for that but on this issue, I believe them, I don’t see spin here.
It’s hard for me to believe someone is telling the truth when they have a history of lying in the past.
It’s the old, “Fool me once, shame on you. Fool me twice, shame on me.” theorem which serves most intelligent people well.
If you don’t see “spin”, well, that’s your choice.
I do not agree, i think it is a clear and desirable one.
yes i know that many can think that what they are doing is shoving more regions into same servers to cash some bucks at the costs of cdn, pipeline and ultimately… Us the customers.
I still hope not and that Linden Lab just prevented some to happen that would be terrible to them, a major failure on the service (OSgrid is down for over 4 months already due to a failure of hardware and still didn’t recover, but that is a non profit organization, not a Business company).
No, Gridstatus does not report the maintenance to have been completed. You are looking at the wrong entry. http://status.secondlifegrid.net/2014/11/14/post2417/
Ah good spot, that explains why I started to write the Grid Status was not updated and then put a line through it, I was reading a different scheduled maintenance update.