A while back there was a flurry of activity around a startup proposing floating datacenters – at the time I thought it was kind of a dumb idea … followed by another post with a more on-the-point headline:
“Floating Data Centers Miss the Point, Add a Bunch of Risk, and Will Keep You Up At Night; On the Other Hand, Deploying Your Applications on a Cloud of Commodity Computers With Appistry’s Application Fabric Will Deliver the Goods” (note: slight edit)
At first glance this appears to be a more far reaching version of the floating data center notion, adding an interesting (though still fairly conceptual) energy generation notion.
The idea of self-generating power ups the potential benefits idea (beyond cooling and portability) a potential big step … yet the two biggest hurdles remain.
Hurdles That Remain
Leaving expense aside, how to connect sufficient bandwidth to a floating data center is remains an enormous challenge – whether wireless or wired, it’s just going to be difficult.
Second, this really will need to be at least as sturdy as The Unsinkable Molly Brown if it’s going to have any value beyond conceptual bantering.
Whether from terrorists, storms, or just inexplicable mistakes, the prospects of all those computers ending up wet wet wet is a sobering one, indeed.
A Big Idea
Bottom line, I think this is a “big idea” that will pop back up from time to time, and will probably even have some very flashy demos and prototypes.
But … and this is a very big but … I think it’s days as a practical alternative for hosting stuff that we really care about are still a long way off … if ever.
Update 1: This story also showed up on Slashdot, with a decent discussion following.
Self-healing is at the core of much of what makes application fabrics work – whether you deploy on commodity in your own place or on a cloud.
It just has to be a non-event when stuff breaks – both for the work in progress, and for the “structural integrity” of the app fabric itself. Ensuring both enables the app fabric to provide the simple abstraction of a reliable, uber-scalable computing surface.
At any rate, I was reminded of this a couple of weeks ago when this video first made the rounds… cool stuff.
I particularly like it because it’s a great illustration of the value of simple goals & simple organizational rules, both in theory & in practice.
Take a few minutes to watch it, if you have not already … and ponder how cool it is when application developers can just assume that their underlying infrastructure behaves as resiliently as do these robot blocks.
Great work guys … keep it up!
In the past week+ the whole business about Twitter scalability & reliability came to a head.
Yet, despite infrastructure that is visibly “hitting the wall”, now it appears that the company is gaining interest in a funding round at a decent valuation (maybe even signed one, but more on that later).
How is this possible?
I think the answer to this is that
- Building a scalable micro-blogging site is not that hard
- There’s hoped-for value in all that traffic
Building a Scalable Micro-Blogging Site
If you’re starting from a clean sheet, the answer is that it’s not very hard to make a scalable micro-blogging platform. Unlike some recent comments, the solution is not rocket science.
The key is simple:
- take the database out of the flow of messages. Of course, you still write to the db as it’s able to keep up (for archival purposes), but that’s about it.
- create objects that stand-in for each subscriber, whether follower, followee, or both
- have them interact over a simple pub / sub model, reliable in-memory space, or both
- wrap all this in our application fabric to handle organization, reliability, and operations as you scale
- deploy on commodity (in-house or cloud)
Of course this becomes pretty hard if you’re committed to Ruby on Rails, which is very tied to a database. Terrific for some stuff, not so good for high-volume messaging apps.
Are there other approaches? Sure, but nothing this conceptually simple, easy to implement, cheap to deploy, and brain-dead simple to operate.
What’s a Twitter to do?
Arrington posted some speculated usage numbers today that are useful in validating this approach. Remember that all of this is greatly aided by two simple facts:
- twitter messages are limited to 140 characters
- delivery expectations for SMS etc. are modest, at best
So easy to deliver, forgiving with regards to when they get delivered … this is really fairly straightforward.
Of course, I’m sure the first order of business for the new technical folks is to stabilize the existing platform … then get to work on something that can be counted on for 10x 100x 1000x this amount of traffic.
The very fact that Twitter is able to raise a round (at a decent valuation) despite the obvious problems is a vote from the venture community that the business will be worth building (traffic is decent, it is hoped to grow significantly, will be of some TBD value to someone), and that building a real infrastructure is eminently doable.
Give us a call!
Yesterday we talked about whether Twitter really ever need to be reliable or not … some said yes, others contend that it’s not necessary.
It’s been bugging me for awhile that something this popular … and Twitter is so … just keels over as often as it does.
Anyhow, the whole argument turned into a bona-fide debacle this morning when GroupTweet (a relatively new feature that seems to have been confusing) was at the heart of disclosing private messages (DMs in tweet-speak) to tons of folks.
So now it looks like Blaine Cook is out as chief architect, and Michael Arrington is calling it the end of amateur hour. That’s probably a bit harsh, because my (limited) interactions with Cook have been pretty decent.
Btw the comment thread on that last post is going crazy. My favorite so far is a short video comment from a Loren Feldman (warning … his language is a bit over the top, but you do know where he stands!) Btw, check here if the first link to the video doesn’t work.
Having said that, we just have to build apps that act like real, grown up (and you can call that boring if you want) apps … taking care of the data entrusted to them, working as expected, and working when we need them to work.
So I’m thinking that the answer to yesterday’s question is … YES. Twitter does need to figure out how to be reliable … and secure, scalable, and all the rest.
This is exactly the point that I’ve been making for awhile … why build to POC quality when it’s now possible to ensure reliability, scalability, and so forth from the beginning?
People are Still People
I don’t care if this is Web 2.0, Enterprise 2.0, or Web 10,000,000,000.0 … consumer or enterprise … people are still people. They still care about their privacy, the reliability of stuff that they come to rely on, basic stuff like that. No free pass.
Even consumer oriented web 2.0 apps need to ensure this, from the beginning.
Pretending that innovation in communication, biz, or technology somehow exempts us from the basics of social interaction is just … well, it’s just wrong.
This lesson is for our whole industry. Those who learn it will prosper, those who don’t …
A few outages ago I wondered aloud whether Twitter was taking the whole business of failure somewhat casually (triggered by some comments Blaine Cook made at SXSW).
Blaine replied with some great points, including
For the record, saying that the press surrounding the downtimes was a plus was a joke. Downtime is never good, and you should do everything you can to avoid it. However, it’s a misrepresentation to say that you can build something successful without any downtime.
Our foremost concern has been and will continue to be ensuring a stable platform; we’ve been working hard on numerous fronts, and that work is paying off. Bad press is horrible, and I’ll be the first to take pleasure in never again seeing a “can Twitter scale?” story.
I believe him, and have quite a bit of empathy for the position he’s in. (I have a friend who always used to talk about “high class problems” and “low class problems” … Blaine and the other Tweetsters have a high class problem, but that’s a post for another day)
There was one point that he made that I fundamentally disagree with, however:
Scaling is a commitment, and one you should only make once you’re sure about an idea.
Yes it is most definitely a commitment, but it is my contention is that we’re fast entering the time when it can be built-in from the beginning with little to no additional effort.
This past weekend there’s was a bunch more instability. Looks like they were putting in some more caching to take the heat off of the data tier, and things went wacky.
Now Robert Scoble is making the case that Twitter is leaving the door wide-open for Friendfeed … (on Twitter of course, though my friend read it on Friendfeed!). Check out this short burst:
Michael Arrington thinks that the mass of the Twitter community makes this concern moot … basically, he contends that Twitter no longer needs to be reliable.
The Days of the Free Pass on Reliability are Over
Basically, I think that 1) the days of web2 services getting a free pass on reliability are rapidly passing, and are probably already over, and 2) it’s a shame to see stuff go whump when it’s sooo unnecessary.
As for the days of the free pass being over, check out Dennis Howlett’s (zdnet) comments on the most recent outage … he’s generally making the case that Twitter itself is really a POC for some better service yet to come, something more suitable to much larger markets.
Could that V1 service be Friendfeed? Maybe. Of course, it’s too early to write-off Twitter entirely … they’ve also hired their own scaling calvary (including the every-helpful Google expat!), so maybe they’ll catch a second wind before the whole sector passes them by.
Build to Scale … From the Beginning
Back to Blaine’s comments. I can completely understand the notion of building a proof of concept … besides, in the web 2.0 world it’s long been accepted practice to throw something out there, and only build to scale when you figure out whether anyone cares.
That makes a lot of sense when building apps to scale is so freakin’ hard. BUT … easing that pain is precisely the point of stuff like our app fabric.
That is why it is my core contention that the ability to scale and be reliable, even for the most trivial services, is going to become the price of entry very soon (if it has not already become so).
Having a bit of “down time” with fam and friends has been great … hope yours has been at least as good.
Seems like about a zillion years since I was buying student tickets for games at Mizzou (the University of Missouri), but like many folks I’ve continued to follow their teams over the years. Really only problem in that … for most of the past 30 years they’ve been pretty bad, well actually that would be understating things … a lot!
But all that’s changing now, and for the first time in my lifetime I got to see Mizzou play a football game in January. With polite apologies to any of you Arkansas fans out there, today’s 35-7 win over Arkansas was awesome.
I mention all of this to give you a bit of background on the rest of this post.
In a recent post I introduced the “mad kitty”, which I will use on any post discussing reliability problems, failures, over-promised and under-delivered features and services, and any other part of our industry that should be done better.
I will continue to use the kitty as events warrant. If everyone was doing the best possible then I’d back off … would it really make sense to get mad at a cow for not being a very good conversationalist?
But the simple reality is the day in which it was ok for a SaaS offering to go down for “system maintenance”, or for a suddenly hot web2 site to crash under load are way, way behind us. Besides, it’s simply not necessary anymore … not with the availability of intrinsically scalable and reliable technologies like our app fabric, at least.
So I will continue to comment on outages, particularly when the organization in question really should know better. And the kitty will mark those posts.
Turns out that kitty has a bit of a heritage.
As much as I like Mizzou, I have a couple of brothers who almost make me feel completely indifferent. Both of them have always been part of communities like tigerboard.
In any case, as long as I can remember Mizzou had a “paw” logo. Simple, looked great, easily recognizable. There was only one problem, it looked a little too much like Clemson’s. Never mind that their’s was orange, turned sideways, and generally looked pretty different – Clemson’s attorneys complained and Mizzou blinked.
So time for a new logo.
Having nothing better to do (or at least deciding to do nothing better), speculation about the logo was pretty intense. Lots of arguing, opinions, people getting mad, blah blah blah.
Then one of my brothers decided to tell people that he’d gotten an advance copy of the new logo. When people challenged him to produce it, he took about five minutes in a drawing program and came up with the mad kitty.
People went nuts. I mean, there were a bunch of people who were really mad.
All sorts of words went flying – “this is stupid”, “my kid could’ve done better”, – and far, far worse. It’s almost like these guys didn’t have proper upbringings or something.
After awhile my brother let on that he might have been speculating just a wee bit. In fact, it may be that the university had gotten a bona-fide designer to come up with a real logo (which they did, of course).
When you think about it, there are some times when the fact that these communities of interest are fairly virtual can be pretty handy.
This was one of those times for my brother.
Turns out that kitty has had a bit of a life since then. Mad Kitty logo gear has sold a bit on cafepress, and he continues to show up in all sorts of places in and around mizzou-land.
Kitty even ended up doing a bit of traveling. When my other brother was deployed to Afghanistan and Iraq post-9/11, he took a series of photos entitled “Kitty goes to war”. I’ll probably post some of those every once in awhile.
This all turned into a bit of fun for our family, and eventually for a bunch of other Mizzou fans as well. I hope that you’ve also had plenty of things that you could laugh about over this holiday season.
Here’s to a great 2008!
Ok, well the broken stuff for today (so far) includes Digg and Yahoo Small Business.
Digg appears to be a partial failure, which is probably the most common in the SaaS and enterprise worlds (as I discussed yesterday). While much of the mainstream functionality seems to continue working, much of the stuff that differentiates Digg (adds its personality, so to speak) is missing in action.
Interestingly enough, as I talked about a few months ago these are precisely the sort of functions that Digg’s dbas are proud of working hard to suppress. I understand that these can be difficult to scale (at least in old-school db-centric implementations that is!), but the business needs them to add value.
And now these value-added functions are gone.
The good news? It could be worse – at least Digg is doing this to themselves. Yahoo, on the other hand …
Yahoo Acts Like a …
On Monday at 6:00AM PT, the systems that power our merchant stores experienced outages, and shoppers of those stores were met with either error messages or they were unable to complete the checkout process …
These issues lasted until about 1:00PM PT
Other than that things were fine.
The Good News
One bit of good news is that both companies have talked a bit about having the problems. Perhaps it’s because the problems themselves affect such a high percentage of their customer bases – they’re just too prevalent to ignore.
The Bad News
Are you kidding me?!??!?!? Both of these SaaS offerings are broken. While Digg is sort-of-working (in a limp-home sort of way), the poor unfortunate merchants who believed and relied upon the promises of Yahoo Small Business have just suffered major losses.
Not even an SLA would do much good here. Check out some of the comments to Riley’s fessin’ up:
This outage cost us big time in terms of money, our time and customer goodwill … Yahoo! should immediately come up with a plan to compensate merchants for this disruption of service on the most highly publicized day of online shopping.
Just telling us the time line of what happened isn’t very useful. We already know that as we watched it happen and suffered the lost business because of it.
Please also give me a good reason(s) why I shouldn’t switch to a different shopping cart provider at my earliest opportunity.
this was catastrophic…
Why is there no redundancy? I have lost faith.
it will take a class action suit for it to be addressed unfortunately
The justifiable outrage goes on and on and on. After all, what’s a merchant going to get back … part of a day’s hosting fee? As if that would compensate for half a day’s lost sales during the make-or-break time of the year for most merchants!
This is Getting Ridiculous
As an industry we just have to do better … customers have a right to expect better, and we must deliver. Talking about reliability and SLAs is simply not enough … we need to get it done.
Btw, this business of broken SaaS offerings is becoming such a common occurrence that I’ve added a new category to this blog, for your convenience. All posts will be marked with the “mad kitty” (I’ll explain later), & I think I’ll probably go ahead and add a tab as well, because I’m pretty sure this topic is simply not going away for the foreseeable future (unfortunately).
As I have commented before, I am a big fan of Google’s basic approach to scalable computing. There is much to like – Captain Enormous scale on commodity gear, rapid deployment of applications, and so on.
Yet it is by no means perfect.
In particular, there is a chronic level of failure in (at least) some of the flagship services that should not be acceptable in any modern day offering, least of all something which is a standard part of many people’s workflows.
For example, about a week ago I (an involuntary testing army of one!) had one day discombobulated due to a series of failures in Google apps. Now today gmail is down (at least for me), and has been for over an hour. I was in the middle of sending an email when it quit, going to this screen:
which times out, then goes to a “waiting to retry message”, then occasionally goes to this:
and then back again to waiting. Waiting but not working.
It’s important to note that google reader (my current favorite rss reader) and google groups (likewise) have both been working all morning, at least for me.
Chronic Partial Failures are Typical
I have no idea how widespread either the gmail outages today, or the rolling app outages of last week are / were. Even worse, I don’t think our industry even has a good way to measure this phenomena.
It’s really a similar problem to the power utility industry. Everybody pays attention to the widespread outages – for example, last winter my family, along with more than 500,000 of our closest friends, were without power for four days after an ice storm … that was a startling, yet beautifully surreal experience in itself … perhaps a post for another day! – but not so much is even discussed about the much more common partial failures.
These partial failures effect some customers for part of the time, perhaps for one operation that didn’t work out, or perhaps stretching out for hours or even days. Unfortunately, I think this type of partial failure is typical in any type of scaled-out system.
It’s Not OK
All too often these sorts of failures (ESPECIALLY the partial ones) are waved off with cavalier comments of “typical”, or “what can you expect for free”, or some other such garbage.
That might have been OK when this was all new, and everybody was just thinking about how cool it was to have an (actually) usable service out there in the cloud, and wasn’t this all just great.
That was yesterday and this is today. It really isn’t OK for SaaS services to work sometimes and not work others … even the free ones.
And what about the truly enterprise applications?
What Can We Do?
This is precisely why we have been working relentlessly for the past six years to create a simple computing world that simply works. Ensuring reliability and scale at the architectural level (without requiring developers and operations folks to do a bunch of stuff each time) is absolutely essential to raising the bar on what we can all expect from scaled-out systems.
Do we need better metrics? Yes. Do we need SLAs with teeth? Absolutely. But we need more, much more – we need to deploy true fabric-like architectures, especially those suitable for the enterprise, and we need them now.
Quick update: in the time that it took me to write this post gmail is back up. At one level that’s good, but at another it’s not – not if the problem is ignored as if it doesn’t exist.
Update #2 – three hours later – broken again! (this time harder) Groups and reader still working mostly OK, though some transitory weirdness in reader.
Just ran across a very good post by Robin Harris from the misty dawn of time (last summer) stemming from the Google Scalability conference. Why should we care how Google scales? Like Robin points out,
They roll out new applications for millions of users with surprising speed, especially compared to corporate IT. They build data centers with hundreds of thousands of servers – and millions of disk drives – and run it all on free software.
Costly corporate kit, like RAID arrays and 15k FC drives, aren’t used. Yet they do more work in an hour than most companies do in a year.
Google’s IT capabilities are a modern wonder of the world. Underneath the complexity though are just three simple rules. Rules that no enterprise data center (EDC) would ever think of following.
What are Google’s three rules?
- Cheap (use commodity everywhere)
- Embrace failure
- Architect for scale
It is very interesting to consider how these three principles interact. For example, admitting that stuff breaks and making sure that that isn’t a problem takes care of a big concern about using commodity equipment. So does "architecting for scale", which takes care of another concern about using commodity gear – can I solve big enough problems?
In any case, what is the net effect for Google? Continuing with Robin’s post:
This is more than first-mover advantage. The faster they can grow, the greater their cost advantage over smaller, less nimble competitors. Their ROI brings them cheap capital, which increases their ability to invest in new businesses and more capacity. The higher their volumes, the cheaper growth becomes. A perfect storm.
All of this work is done by hordes of very smart folks at Google. Yet with all of the advantages, there are many limitations. There are well-known reliability problems, as well as complicated operations. Harris also points out that
Google’s purpose-built infrastructure is also relatively inflexible: they can’t just paste on (acid) transaction processing.
That’s where application fabrics come in. To this potent set of rules we add one or two of our own, absolutely necessary to make commoditization practical for the enterprise. What are the additional rules?
One of the often-repeated baseball truisms is “that you can never have too much pitching”. Even if you don’t know anything about baseball, you can tell that this is true by just searching on that phrase and see what comes up. Go ahead: I’ve made it easy!
(for the non-baseball folks out there Bob Gibson is one of the absolute all-time greats, a pitcher’s pitcher … every baseball team that ever was or ever will be would love to have Mr. Gibson on their team)
Simplicity Really Matters
In the world of scalable applications there is a rule above all rules – simplicity really matters. Or in tribute to the tattered, yet still great game of baseball, “you can never have too much simplicity”.
You can say this many different ways, but the reality is that in order to really build scalable systems we must strive for the simplest abstractions possible.
For a minute I thought I was reading one of our new marketing pieces (I wasn’t) … Nikita Ivanov seems to be all over the “scalability simplified” theme. Of course I agree with his basic point, but there’s more to the story of course.
Making It Real
Even Ivanov’s jab at Nati Shalom illustrated an underlying reality, ignored all too often – enabling a simple world can be complicated. Of course any complexity needs to be supporting an elegantly simple abstraction, such as the one we present. The problems arise when that complexity is exposed, as it is in the vast majority of computing architectures.
In any case, just arguing for development simplicity (while commendable) isn’t enough. After all, somebody has to deploy and operate what you build.
The Whole Story
So yes we must deliver simplicity to the developer … that is a key for enabling scalable applications. But don’t forget the other two legs to this stool:
- Operational Simplicity. The biggest fabrics (or grids) absolutely must be at least as simple to operate as a single server … no matter how big they get.
- Reliability. A fabric must be able to simply ensure the reliability of each operation – this is crucial for being able to rely on commodity infrastructure.
Taken together (development simplicity, reliability, and operational simplicity), then you have an approach that’s meaningful. That is exactly what people are discovering with application fabrics.
Go ‘git me some of that simplicity!