Update: The WSJ posted a more detailed article elaborating on concerns among analysts over whether RIM was in over their head, and how the repeated outages would affect their reputation with business customers. While interesting, note that the analysts quoted focus on the possibility that “RIM has inadequate backup systems”. I think the problem is more fundamental than that … read on for some thoughts as to the real culprit(s).
As has been widely reported (Another Black Eye for Blackberry – WSJ, Blackberry Network Recovering After Major Outage – DataCenter Knowledge, New Delay Hits Blackberrys – NY Times, …) Blackberry users (I am not one) appear to have suffered an extended outage (8 hours+) yesterday, perhaps extending through the evening. Reports of the locations affected varies, but appear to have included the US and possibly other areas around the world.
What I thought was interesting was a tidbit in an official statement from RIM (emphasis mine):
(The) root cause is currently under review, but based on preliminary analysis, it currently appears that the issue stemmed from a flaw in two recently released versions of BlackBerry Messenger (versions 22.214.171.124 and 126.96.36.199) that caused an unanticipated database issue within the BlackBerry infrastructure,” the company said.
So reading between the lines, a new version of a mobile app was released that started putting additional pressure on one or more key databases inside the core service. That problem continued unabated through another release of the mobile app, while the key database(s) continued to be affected, each and every day, until finally outages began to appear (again).
On the positive side RIM has, in many ways, operated one of the first SaaS offerings for many years – the Blackberry messaging service – in a way that is curiously low-profile, yet crucial to many all at the same time. Low profile in the sense that when people talk about SaaS almost no one mentions RIM, crucial in the sense that so many subscribers rely on them to deliver their basic messaging.
It’ll be interesting to follow this a little further, but given the age of the Blackberry software infrastructure it wouldn’t be surprising at all to find relational databases (or perhaps some other centralized data store) here and there throughout the message flow. Perhaps in only some sort of meta role, yet apparently still crucial enough that their failure can take out the entire service.
In any case, when all is said and done my guess is that this will simply be another case of traditional databases defining the scaling horizon of a service, limiting it in ways that are (these days) entirely unnecessary.
While the details on how traditional databases limit the scaling horizon of cloud-based systems require much more discussion than is reasonable for this post, just keep one thought in mind: any dependency that is not intrinsic to the actual problem, to the data itself, must be eliminated.
As much as I’d like to think that this will be the last Mad Kitty of 2009, experience indicates that this will probably not be so … unfortunately.