Update: The WSJ posted a more detailed article elaborating on concerns among analysts over whether RIM was in over their head, and how the repeated outages would affect their reputation with business customers. While interesting, note that the analysts quoted focus on the possibility that “RIM has inadequate backup systems”. I think the problem is more fundamental than that … read on for some thoughts as to the real culprit(s).
As has been widely reported (Another Black Eye for Blackberry – WSJ, Blackberry Network Recovering After Major Outage – DataCenter Knowledge, New Delay Hits Blackberrys – NY Times, …) Blackberry users (I am not one) appear to have suffered an extended outage (8 hours+) yesterday, perhaps extending through the evening. Reports of the locations affected varies, but appear to have included the US and possibly other areas around the world.
What I thought was interesting was a tidbit in an official statement from RIM (emphasis mine):
(The) root cause is currently under review, but based on preliminary analysis, it currently appears that the issue stemmed from a flaw in two recently released versions of BlackBerry Messenger (versions 22.214.171.124 and 126.96.36.199) that caused an unanticipated database issue within the BlackBerry infrastructure,” the company said.
So reading between the lines, a new version of a mobile app was released that started putting additional pressure on one or more key databases inside the core service. That problem continued unabated through another release of the mobile app, while the key database(s) continued to be affected, each and every day, until finally outages began to appear (again).
On the positive side RIM has, in many ways, operated one of the first SaaS offerings for many years – the Blackberry messaging service – in a way that is curiously low-profile, yet crucial to many all at the same time. Low profile in the sense that when people talk about SaaS almost no one mentions RIM, crucial in the sense that so many subscribers rely on them to deliver their basic messaging.
It’ll be interesting to follow this a little further, but given the age of the Blackberry software infrastructure it wouldn’t be surprising at all to find relational databases (or perhaps some other centralized data store) here and there throughout the message flow. Perhaps in only some sort of meta role, yet apparently still crucial enough that their failure can take out the entire service.
In any case, when all is said and done my guess is that this will simply be another case of traditional databases defining the scaling horizon of a service, limiting it in ways that are (these days) entirely unnecessary.
While the details on how traditional databases limit the scaling horizon of cloud-based systems require much more discussion than is reasonable for this post, just keep one thought in mind: any dependency that is not intrinsic to the actual problem, to the data itself, must be eliminated.
As much as I’d like to think that this will be the last Mad Kitty of 2009, experience indicates that this will probably not be so … unfortunately.
Two applications areas for which cloud computing holds the most promise are in the related areas of intelligence and military applications.
Even if you are not already intimately familiar with the types of computing problems that dominate these application areas, it’s easy enough to see how cloud computing – and of course I mean all sorts of clouds, with a particular emphasis on private clouds – can help.
After all, the very attributes of clouds that are so attractive to startups and enterprise alike – easy sense of scale, flexibility, low cost, and more – have tremendous appeal for intelligence and military applications as well.
A Military Perspective
I was recently interviewed for a story that’s appearing In the current issue of Military Information Technology. entitled “COMPUTING IN THE CLOUDS”. The story covers a number of cloud initiatives, with a focus on some things that are working and challenges that are looming.
Here is a cool quote from the story:
Appistry offers a linchpin technology for cloud computing, called the Enterprise Application Fabric, a cloud application platform for developing and managing large-scale, selfhealing cloud applications rapidly on commodity hardware.
Why Is Appistry a “Linchpin Technology”?
In this quote the story captures precisely one of the concerns of both those pioneering and those contemplating cloud applications in military and intelligence – sure the inherent scale and flexibility are great, but what about the complexity?
Speaking from the IC side of the house, streaming full-motion video from a Predator UAV or a satellite image are huge files to deal with in terms of storage, processing and transport to a soldier in motion…
However, a disadvantage is the added complexity of virtualization, which is inherent in cloud architecture (em. added). “When we virtualize in a cloud, it is more difficult to unwind the problem should it arise. As virtualization increases, logical complexity grows,” Pierce pointed out.
- Ken Pierce, DIA-DS/C4ISR
He went on to say that his organization is already well-positioned to handle the added complexity – but what else can he really say?
The Real Value of a Cloud Application Platform
It is precisely in aggressively taking out complexity – both operational and development – while maintaining all of the goodness of clouds that this emerging thing the industry has begun calling a cloud application platform. delivers the goods.
As you might expect, Appistry EAF as it exists today makes an excellent cloud application platform, and stuff that we’re hard at work on – even as we speak – will expand that lead.
And that is why Appistry is becoming a “linchpin technology”.
(with apologies to the good Dr. Seuss on the title – sorry, I just couldn’t help it)
The participants included
- Robert X. Cringely, computer guy & moderator
- Anwar Ghuloum, Intel
- Charles E. Leiserson, Cilk Arts & MIT
- Dan Reed, Microsoft
- Mark Snir, University of Illinois at Urbana – Champaign
and me (go ahead and give me a hard time, I can take it).
The range of discussion was interesting, since the panel included perspectives more rooted in multi-core (Ghuloum and Leiserson), mass ‘o machines (Reed, Snir), and more of a uniform view of both broad classes (me, though I think that’s may be shared by some of the other folks as well, at least a bit).
In addition, the panel was a mix of research and practical applications, which probably tended to color much of the discussion. All in all it made for cool (and hopefully not too boring for the audience!) conversation.
This is one panel that I probably would have much rather had in private, definitely accompanied by some really good adult beverages, but unfortunately we were constrained to an hour on a stage … and (at least for me) that hour passed by pretty fast.
An hour was probably enough time to begin to name a couple of the larger issues, but definitely not time enough for too much more.
Still, it did get me to thinking a bit …
A Few of the Issues
There were (and are) many more of course, but here are a few of the more dominant themes and issues that we discussed …
Market Pull. Whether it’s the inability of the processor manufactures to build individually faster cores at any price we could stomach (hence the advent of multi-core), or the advent of practical clouds (both public and private) opening up the prospects of deploying REALLY BIG apps on LOTS of VMs, the market is clearly demanding new solutions to creating parallelized apps. No question about it.
Complexity is Bad. There was a general agreement that complexity is, well … complex and generally toxic to effective development of parallel apps. Some folks had more of a stomach for complexity than others, but all in all many of the efforts are trying to fundamentally simplify the developer’s task.
Need for New Abstractions. The Complexity Problem is not going to be solved by wishful thinking alone, no matter what Oprah says (sound bite alert). Hence everything from new functional languages like F#, Erlang, Scala, to frameworks like map-reduce, to data-driven reliable service abstractions like our own application fabric are in play as ways to simplify.
Uncovering Inherent Data Orthogonality. I’ve gradually come to the opinion that some very high percentage of the apparent data dependencies that are anathema to effective parallel processing are not truly in the original problem. Rather, they are false dependencies, ones that we have inflicted on ourselves for no particularly good reason other than the tools, methodologies, or just bad habits that we bring to bear on our work.
(btw, don’t press me on a precise percentage or I’ll be forced to make something up here)
We’ve seen this with customers, and the more I look at new problems and how they are solved in most enterprises today, the more I see a big, massive goo of false dependencies.
Fix those, and we have a crack at effective parallelization in many cases.
Where This is Going
I am very optimistic about progress in helping developers create actual parallel applications that can be used in the enterprise, in production solving problems about which people actually care.
The population of these well-done apps is going to be increasing dramatically in the months and years to come, which is a good thing … a very good thing.
The timing couldn’t be better … in truth, I don’t think we really have much of a choice.
We are very active in this space, and I have a particular interest in the “false dependency” problem. I’m sure I’ll be posting more on this in the future.
The comment thread on that post has been almost comical. Some folks are wondering if this is a political decision, some think the DB isn’t optimized, some people think it’s Ruby’s fault, some people are appealing to Twitter to not abandon the RoR community.
This is just nuts, & everybody should just calm down.
The Simple Truth
The application in question (Twitter) is fundamentally a messaging problem, & a modest one at that. Putting a DB into the flow IS THE PROBLEM.
Twitter has been asking a DB to act like a router, something which it is pretty bad at doing.
Why be surprised when it doesn’t work?
In the past week+ the whole business about Twitter scalability & reliability came to a head.
Yet, despite infrastructure that is visibly “hitting the wall”, now it appears that the company is gaining interest in a funding round at a decent valuation (maybe even signed one, but more on that later).
How is this possible?
I think the answer to this is that
- Building a scalable micro-blogging site is not that hard
- There’s hoped-for value in all that traffic
Building a Scalable Micro-Blogging Site
If you’re starting from a clean sheet, the answer is that it’s not very hard to make a scalable micro-blogging platform. Unlike some recent comments, the solution is not rocket science.
The key is simple:
- take the database out of the flow of messages. Of course, you still write to the db as it’s able to keep up (for archival purposes), but that’s about it.
- create objects that stand-in for each subscriber, whether follower, followee, or both
- have them interact over a simple pub / sub model, reliable in-memory space, or both
- wrap all this in our application fabric to handle organization, reliability, and operations as you scale
- deploy on commodity (in-house or cloud)
Of course this becomes pretty hard if you’re committed to Ruby on Rails, which is very tied to a database. Terrific for some stuff, not so good for high-volume messaging apps.
Are there other approaches? Sure, but nothing this conceptually simple, easy to implement, cheap to deploy, and brain-dead simple to operate.
What’s a Twitter to do?
Arrington posted some speculated usage numbers today that are useful in validating this approach. Remember that all of this is greatly aided by two simple facts:
- twitter messages are limited to 140 characters
- delivery expectations for SMS etc. are modest, at best
So easy to deliver, forgiving with regards to when they get delivered … this is really fairly straightforward.
Of course, I’m sure the first order of business for the new technical folks is to stabilize the existing platform … then get to work on something that can be counted on for 10x 100x 1000x this amount of traffic.
The very fact that Twitter is able to raise a round (at a decent valuation) despite the obvious problems is a vote from the venture community that the business will be worth building (traffic is decent, it is hoped to grow significantly, will be of some TBD value to someone), and that building a real infrastructure is eminently doable.
Give us a call!
Yesterday we talked about whether Twitter really ever need to be reliable or not … some said yes, others contend that it’s not necessary.
It’s been bugging me for awhile that something this popular … and Twitter is so … just keels over as often as it does.
Anyhow, the whole argument turned into a bona-fide debacle this morning when GroupTweet (a relatively new feature that seems to have been confusing) was at the heart of disclosing private messages (DMs in tweet-speak) to tons of folks.
So now it looks like Blaine Cook is out as chief architect, and Michael Arrington is calling it the end of amateur hour. That’s probably a bit harsh, because my (limited) interactions with Cook have been pretty decent.
Btw the comment thread on that last post is going crazy. My favorite so far is a short video comment from a Loren Feldman (warning … his language is a bit over the top, but you do know where he stands!) Btw, check here if the first link to the video doesn’t work.
Having said that, we just have to build apps that act like real, grown up (and you can call that boring if you want) apps … taking care of the data entrusted to them, working as expected, and working when we need them to work.
So I’m thinking that the answer to yesterday’s question is … YES. Twitter does need to figure out how to be reliable … and secure, scalable, and all the rest.
This is exactly the point that I’ve been making for awhile … why build to POC quality when it’s now possible to ensure reliability, scalability, and so forth from the beginning?
People are Still People
I don’t care if this is Web 2.0, Enterprise 2.0, or Web 10,000,000,000.0 … consumer or enterprise … people are still people. They still care about their privacy, the reliability of stuff that they come to rely on, basic stuff like that. No free pass.
Even consumer oriented web 2.0 apps need to ensure this, from the beginning.
Pretending that innovation in communication, biz, or technology somehow exempts us from the basics of social interaction is just … well, it’s just wrong.
This lesson is for our whole industry. Those who learn it will prosper, those who don’t …
A few outages ago I wondered aloud whether Twitter was taking the whole business of failure somewhat casually (triggered by some comments Blaine Cook made at SXSW).
Blaine replied with some great points, including
For the record, saying that the press surrounding the downtimes was a plus was a joke. Downtime is never good, and you should do everything you can to avoid it. However, it’s a misrepresentation to say that you can build something successful without any downtime.
Our foremost concern has been and will continue to be ensuring a stable platform; we’ve been working hard on numerous fronts, and that work is paying off. Bad press is horrible, and I’ll be the first to take pleasure in never again seeing a “can Twitter scale?” story.
I believe him, and have quite a bit of empathy for the position he’s in. (I have a friend who always used to talk about “high class problems” and “low class problems” … Blaine and the other Tweetsters have a high class problem, but that’s a post for another day)
There was one point that he made that I fundamentally disagree with, however:
Scaling is a commitment, and one you should only make once you’re sure about an idea.
Yes it is most definitely a commitment, but it is my contention is that we’re fast entering the time when it can be built-in from the beginning with little to no additional effort.
This past weekend there’s was a bunch more instability. Looks like they were putting in some more caching to take the heat off of the data tier, and things went wacky.
Now Robert Scoble is making the case that Twitter is leaving the door wide-open for Friendfeed … (on Twitter of course, though my friend read it on Friendfeed!). Check out this short burst:
Michael Arrington thinks that the mass of the Twitter community makes this concern moot … basically, he contends that Twitter no longer needs to be reliable.
The Days of the Free Pass on Reliability are Over
Basically, I think that 1) the days of web2 services getting a free pass on reliability are rapidly passing, and are probably already over, and 2) it’s a shame to see stuff go whump when it’s sooo unnecessary.
As for the days of the free pass being over, check out Dennis Howlett’s (zdnet) comments on the most recent outage … he’s generally making the case that Twitter itself is really a POC for some better service yet to come, something more suitable to much larger markets.
Could that V1 service be Friendfeed? Maybe. Of course, it’s too early to write-off Twitter entirely … they’ve also hired their own scaling calvary (including the every-helpful Google expat!), so maybe they’ll catch a second wind before the whole sector passes them by.
Build to Scale … From the Beginning
Back to Blaine’s comments. I can completely understand the notion of building a proof of concept … besides, in the web 2.0 world it’s long been accepted practice to throw something out there, and only build to scale when you figure out whether anyone cares.
That makes a lot of sense when building apps to scale is so freakin’ hard. BUT … easing that pain is precisely the point of stuff like our app fabric.
That is why it is my core contention that the ability to scale and be reliable, even for the most trivial services, is going to become the price of entry very soon (if it has not already become so).
I’d like to propose a simple thought experiment. Consider this question:
What if computing is free?
While we’re at it, assume that scale is always sufficient for the problem at hand, latency is acceptable, your applications always work, and that operations are cheap enough to be in the noise.
What’s the Point?
The point of this is simple enough. One answer to this thought experiment was Google … and that worked out pretty well.
Google would not be possible without commodity infrastructure, and apps that assume that they have (more or less) free, unlimited, access to that infrastructure.
Same for most of web 2.0 – after all, most bigger sites are (very loosely) built around some of the same principles. While there are some notables exceptions (EBAY) and many fundamental differences exist, the common meta-trend is that commodity is the right choice for the biggest, gnarliest, most demanding applications..
Now for the Enterprise
Yet that thought has not really begun to penetrate most enterprises. Kind-of commodity may be OK in a fairly stateless web tier, and perhaps for some occasional modeling or research apps, but elsewhere the closest are racks of expensive, heavily-managed blade farms.
Those blade farms may help with operations, but since those farms are normally driven from the operations side of the enterprise, they don’t mean much to the apps. Consequently, these farms haven’t done much for scale for most apps.
Plus they’re still expensive.
Of course, they ARE most definitely commodity when compared with the Z-class mainframes that still dominate the batch settlement / customer service operations that are so prevalent in enterprises the world over.
A Financial Services Example
We have a financial services customer who decided to instantiate this thought experiment – they’ve implemented their settlement infrastructure on commodity. Commodity organized by an application fabric (ours!), so that it is reliable, arbitrarily scalable, and very cheap to operate.
The results? They’re matching industry norms for settlement performance on Z-class mainframes with a handful of commodity boxes … and they can keep scaling for a few hundred bucks at a time. Plus it’s reliable, and never gets more expensive to operate.
That will change their industry.
Back to the Thought Experiment
Over the past couple of years I keep running into organization after organization that has existing operations built on the constraints of expensive, heavy, traditional computing. Constrained by state, constrained by the data tier, constrained by I/O, constrained by budgets … but mostly constrained by human nature, by organization inertia, by just thinking about the problem the way it’s always been thought about.
Whole industries, for that matter.
Time to change that – ask yourself, what if computing is free?
In my earlier post I commented on my own little experiment about Web 2.0 infrastructure’s ability to handle even modest-interest events.
Well my initial verdict was that most players had fallen flat on their face … badly.
Of course, the Techcrunch post somehow overlooks the equivalent failures over at Crunchgear – imagine that!
On the negative side these failures show just how immature these architectures still are … of course, on the positive side it absolutely demonstrates the widespread need for something better … something like our EAF!
I was experimenting around a bit with different ways to track the macworld keynote from my cozy office. I figured this would be good nano-metric on how far we’ve matured web 2.0 scaling techniques, particularly when focusing on delivering an event experience. This is a perfect example of a specialized community – bigger than some, smaller than many.
Sorry to say that every venue that was directly trying to cover the keynote has to get a mad kitty. I was hoping for better.
Starting with a sort of “irritated kitty” were engadget and gizmodo … both of their live posts timed out quite a bit at first, then settled down and responded ok, albeit slowly.
Tried following the twitter feed of TUAW, but twitter’s website started timing out a few minutes before the start of the keynote. Kept timing out for another five or ten minutes, so I gave up. Btw, don’t know about their SMS distribution since I choose to turn it off momentarily (I do have biz to do!).
[update] Twitter died hard during the keynote.
[update] Crunchgear was apparently pretty hosed during the event.
Ran into “stevenotelive.com” and tried that … had it up for a half hour before the start of the event, during which they ran a continuos loop boasting on their “ridiculous bandwidth”. They had a little counter that was over a thousand 15 minutes before the start of the event, and then …
They were right, it was pretty ridiculous. So ridiculous that when the event started it just blew it’s guts all over the floor. I was never able to get a connection, much less sustain any streaming whatsoever.
Apple didn’t even try to broadcast the event, and the promised iphone updates were not available when Jobs said that they were.
But probably the biggest mad kitty of all goes to Randy Newman, who apparently compared the US and President Bush to Hitler and Stalin. The US has zillions of flaws, but that’s not even rational. I think this dude’s been on the road (or something else) WAY too long. Time to retire, Randy.
For a little speech that is big by tech-world standards, not much worked very well. As far as I’m aware there weren’t any live streaming systems that worked at all. The text-based wide-distribution stuff (I focused on twitter) didn’t make it. The blog-based stuff worked intermittently, and the iphone update server reported errors for awhile.
Of course, this is not a comprehensive view … just a little core-sample from our own industry’s back yard. I think the right answer is that our industry still has a long way to go to handle modest special-interest scale, without even beginning to deal with truly society-wide scale.
First things first, I suppose – let’s starting using more app fabrics to make the basics work better!
[update] Apple software update does appear to be creaking along now, a half hour after the keynote ended.
[update 2] The Macbook Air looks pretty cool though, I must admit!
[update 3] Engadget sort of owns up to their outages … nobody else does so far.
[update 4] Crunchgear posts their own mea culpa.
[update 5] In a big bit of irony, Techcrunch bags on Twitter, but gives Crunchgear a pass.