Over at the cloud computing group in googlegroups there is an interesting discussion about optimal load-utilization. Along the way Tim Freeman brought up an interesting point:
Are there hidden costs at running this high in the first place? We’ve heard the opinion from someone who is in charge of buying 100s-1000s of computers a year that commodity hardware isn’t made to run at this capacity. That you’re not getting as much value for your money over time because of far higher failure rates (i.e., that failures don’t increase linearly with utilization and that there is usually a sweet spot)
So that got me to thinking …
Heat Really Does Kill
Obviously there are many factors in failure rates of computing equipment (spinning or simply processors etc.), but assuming that you have not-horrible power cleanliness the #1 enemy will be heat.
Heat. Heat. Heat.
So, with that in mind one important way stuff becomes server-grade (i.e., expensive, non-commodity gear) is to get better at cooling than commodity gear. Interestingly, server-grade stuff also tends to try to get that last hard-to-obtain chunk of performance out of the components as well as provide varying amounts of built-in redundancy, both of which exacerbate the heat problems considerably, causing the heat-dissipation to get even better, which requires more power, etc.
So in that sense what Tim’s contact has a point, since when running at full utilization most processors throw off lots of extra heat, necessitating (at the very least) extra gear to handle.
And there’s always the chance that the heat will be poorly dissipated, thereby resulting in increased failures … yet that does not mean that buying server-grade gear is the right way to go anymore. Far from it.
A Better Choice
A couple of choices come immediately to mind -
- use lower-power components (as in laptop grade stuff). These will naturally generate less heat, and thereby tend to reduce their self-inflicted failure tendencies.
- run much leaner power supplies than most folks want to supply off the shelf
There’s other ideas – some interesting, some dumb – but those are a few for starters.
Is the Commodity Gear Today What We Need?
Interestingly enough, most of the stuff that folks have bought to build out grids has been server-grade in drag, more or less. Just look at the components and the power supplies – high energy consumption processors, big power supplies, beaucoup fans etc. Not always, of course, but that has generally been the norm.
In fact, it’s this “server in disguise” gear that passes for commodity in most enterprise data centers today … fine so far as it goes. As Cameron pointed out in the thread you can run the current commodity gear at 100% utilization with no particular increase in failure rates. True enough, but what if we think more aggressively?
In fact, let me go so far as to suggest that if we really are able to run at 100% for months without a failure, then we’ve massively overbuilt the “commodity” gear.
Back to what will be possible in changing our infrastructure as we make our transition to clouds – public or private.
This is the Key – Absolutomente Crucial!
Underlying all of the power / failure related infrastructure choices is an unspoken reality – the real key to using commodity at scale is to ensure that the application will survive the failure of individual computers / drives / switches / whatever without losing a darn thing.
Once you do that, at the application level, then you are free to experiment with different infrastructure choices to your hearts content, different utilization rates, whatever comes to mind – provided that your apps don’t care.
In other words, many of the benefits that may result from cloud computing – flexibility, scalability, lower costs, reliability, and so on – are actually enabled at the application layer.
One more thing – when failure of individual computers doesn’t matter to the application then you can pick lower power stuff that is also very cheap – now you’re starting to talk about a great cloud infrastructure.
So as you carry this thinking further then you can start to imagine a much more aggressive type of commodity, one as yet unrealized.
Start thinking of bare-bones, fairly dense components that are uber-cheap … sort of a lego-block approach. Cheap as in $300 -$400 cheap all-up. Perfectly suited for enterprise-grade clouds – public or private – at least those that play by these new rules.
There was (and will continue to be) quite a bit more conversation on this point – it’s one of the more interesting parts of commoditization. In any case, in a future post I’ll outline some more thoughts on the “new commodity” that I believe is fundamentally possible.
Ran across a funny pair of dialogues (from a Niraj J, obviously an app guy!) before and after the adoption of private clouds in a hypothetical enterprise. At least it would be funny if the before case wasn’t so painfully true.
Before (edited for clarity):
Applications VP – I need to unarchive some 300 GB of data and then use it for some analytics that I need to perform at least once every month.
Infrastructure Guy – 1GB costs about X $ and 1 LPAR with 2 CPUS is about Y$ per year. You need to multiply this by 5 years to get the ROI calculation for your project.
Applications VP – Wow! Why is the cost 1.7 times Frys?
Infrastructure Guy- Well it is all the overhead – The Company needs to pay guys like us who ensure that additional storage is installed correctly and that your group adheres to all the norms we have established.
Applications VP- OK (whatever … since I do not have any options!), when can I get it?
Infrastructure Guy- it will take 2-4 weeks after the purchase order is approved and quote submitted.
My only comment here is that it’s probably more like 8-whenever weeks in most enterprise shops, not 2-4.
In any case, here’s after:
Applications VP – I need to unarchive some 300 GB of data and then use it for some analytics that I need to perform at least once every month.
Infrastructure Guy – Here you go , call this API for Adding Storage and launching an instance. You will be charged by the hour.
Applications VP- Cool, I am charged ½ of what you guys charged me earlier and I have the ability to turn off my meter when I’m done.
Infrastructure Guy – Yes, They have cut down our group, and all my buddies who did not have scripting skills have been asked to go. I guess our overhead is now 1.1 X as compared to 1.7X. Besides if you consider the savings you get by switching your computing off when not needing it , we are probably cheaper than Frys.
On a serious note, this sort of on-demand flexibility will be just as appealing within an enterprise as it clearly has been outside.
Furthermore, I think it’s likely to be just as appealing to the operations folks as it obviously will be the applications groups, for two halves of exactly the same reason – it makes life simpler.
It’s just a matter of time.
Note that I’m not commenting here on whether the private cloud is built entirely within an enterprise, is provisioned on-demand from an external provider, or both. All of those are possible, and perhaps even likely. This is just to illustrate why people will – and do – care.
Maybe the understatement of the year comes from a commenter at datacenterknowledge, when he said
I guess I’d have a few concerns.
To say the least!
As this photo from the company’s brochure shows, the plan is to have "containerized data centers" on deck, with more conventional data centers below decks. The idea is to have them more or less permanently moored at docks, so their marketing picture is a bit misleading. I suppose that would be essential for both power and bandwidth reasons.
In any case, the problems here could be enormous. For starters, I can think of concerns over
- Saltwater-Induced Corrosion.
- Commercial Extortion.
- Drunken Fisherman.
About the only things these would do better is to limit physical access and (perhaps) dissipate heat. For that matter, this is really just a band-aid solution to the fundamental problems that plague data centers today – energy consumption, heat dissipation, and most often the simple need for more space.
This is no solution for the core problems – it simply masks them with a different (pardon the pun) container.
A Better Plan
The beginning of a real solution is to make the decision to go to a commodity infrastructure, then utilize an application fabric to provide scalability, reliability, and simple operations for the apps and their underlying (and now commoditized) infrastructure.
Then you can select for metrics like capacity-per-watt and / or capacity-for-the-budget, without compromising scale, reliability, or operational integrity in any manner.
You can even deploy in a cloud if you’d like.
The point is you’ll have the choice to do what makes the most sense, with no need to pick up a bunch of additional problems from problematic data centers.
One of my favorite parts of the role that Nicholas Carr is playing as an observer of modern computing culture, and a fomenter of useful change, is not so much what he has to say – and I think he says a lot of very insightful, very useful things – but what he triggers other people to say, think, and perhaps do.
At the very least, Carr certainly makes the conversation in our industry far more interesting.
The buzz around The Big Switch started a few months back, but really kicked into high gear just before Christmas. The book was formally released today, so I look forward to reading it soon.
Bernard Golden has a good review up at cio.com. From his review:
Carr argues, computing is moving from company-based data centers to large utility computing infrastructures run by the likes of infrastructure providers (e.g., Amazon and its EC2 offering) and centralized services run by application providers (e.g., Google Applications) …
… IT organizations will be superseded by end user organizations taking computing into their own hands, aided by the availability of centralized utilities and applications …
… The second half of the book goes in a different direction, though. Having described the advantages of centralized computing, Carr begins to methodically outline its drawbacks …
From a recent Q&A with Wired (part of the book buzz) comes this quote:
Wired: When does the big switch from the desktop to the data cloud happen?
Carr: Most people are already there. Young people in particular spend way more time using so-called cloud apps — MySpace, Flickr, Gmail — than running old-fashioned programs on their hard drives. What’s amazing is that this shift from private to public software has happened without us even noticing it.
All of these are pretty good points – not only are they hard to argue with, why would you want to?
The Sound of Inevitability
There is no doubt that clouds have been cutting a wide swath through much of the computing that people really do for the past ten years or so. Quietly until recently, but now a simple, widely-accepted fact of life.
The Whole Story?
Yet … this is definitely not the whole story for the enterprise.
For those applications that are clearly present in the cloud – salesforce.com being the most obvious current-day enterprise example – there’s no doubt that end user organizations, with or without the cooperation and assistance of their IT organization, will simply roll their own.
Beyond these core services, however, most apps will still be built by somebody and run somewhere. Sure, they may be a standard app that’s bought and deployed in a cloud, but they may just as easily (and more likely in many cases) be composite applications built out of the best components that you can live with, wherever they’re found. In the cloud, in the data center, at somebody’s house for that matter.
Anyplace that can meet the scale and data security needed for that particular app.
The point is that the stuff that runs an enterprise has two main functions – it encapsulates what that enterprise knows how to do (hopefully better than their competitors), and it enables a big chunk of that company’s competitive advantage … and this is true no matter who builds it or where it runs.
That is why it is so important to begin building and deploying apps that are truly indifferent to the number of components and locations of the physical infrastructure, that are very happy with lots of commodity computers, that can just as easily make use of cloud apps and components and proprietary apps, and in any of these combinations will simply work as intended.
If we can do this while making it much simpler to build the app – and we can (and have) – then all the better!
Ok, well the broken stuff for today (so far) includes Digg and Yahoo Small Business.
Digg appears to be a partial failure, which is probably the most common in the SaaS and enterprise worlds (as I discussed yesterday). While much of the mainstream functionality seems to continue working, much of the stuff that differentiates Digg (adds its personality, so to speak) is missing in action.
Interestingly enough, as I talked about a few months ago these are precisely the sort of functions that Digg’s dbas are proud of working hard to suppress. I understand that these can be difficult to scale (at least in old-school db-centric implementations that is!), but the business needs them to add value.
And now these value-added functions are gone.
The good news? It could be worse – at least Digg is doing this to themselves. Yahoo, on the other hand …
Yahoo Acts Like a …
On Monday at 6:00AM PT, the systems that power our merchant stores experienced outages, and shoppers of those stores were met with either error messages or they were unable to complete the checkout process …
These issues lasted until about 1:00PM PT
Other than that things were fine.
The Good News
One bit of good news is that both companies have talked a bit about having the problems. Perhaps it’s because the problems themselves affect such a high percentage of their customer bases – they’re just too prevalent to ignore.
The Bad News
Are you kidding me?!??!?!? Both of these SaaS offerings are broken. While Digg is sort-of-working (in a limp-home sort of way), the poor unfortunate merchants who believed and relied upon the promises of Yahoo Small Business have just suffered major losses.
Not even an SLA would do much good here. Check out some of the comments to Riley’s fessin’ up:
This outage cost us big time in terms of money, our time and customer goodwill … Yahoo! should immediately come up with a plan to compensate merchants for this disruption of service on the most highly publicized day of online shopping.
Just telling us the time line of what happened isn’t very useful. We already know that as we watched it happen and suffered the lost business because of it.
Please also give me a good reason(s) why I shouldn’t switch to a different shopping cart provider at my earliest opportunity.
this was catastrophic…
Why is there no redundancy? I have lost faith.
it will take a class action suit for it to be addressed unfortunately
The justifiable outrage goes on and on and on. After all, what’s a merchant going to get back … part of a day’s hosting fee? As if that would compensate for half a day’s lost sales during the make-or-break time of the year for most merchants!
This is Getting Ridiculous
As an industry we just have to do better … customers have a right to expect better, and we must deliver. Talking about reliability and SLAs is simply not enough … we need to get it done.
Btw, this business of broken SaaS offerings is becoming such a common occurrence that I’ve added a new category to this blog, for your convenience. All posts will be marked with the “mad kitty” (I’ll explain later), & I think I’ll probably go ahead and add a tab as well, because I’m pretty sure this topic is simply not going away for the foreseeable future (unfortunately).
As I have commented before, I am a big fan of Google’s basic approach to scalable computing. There is much to like – Captain Enormous scale on commodity gear, rapid deployment of applications, and so on.
Yet it is by no means perfect.
In particular, there is a chronic level of failure in (at least) some of the flagship services that should not be acceptable in any modern day offering, least of all something which is a standard part of many people’s workflows.
For example, about a week ago I (an involuntary testing army of one!) had one day discombobulated due to a series of failures in Google apps. Now today gmail is down (at least for me), and has been for over an hour. I was in the middle of sending an email when it quit, going to this screen:
which times out, then goes to a “waiting to retry message”, then occasionally goes to this:
and then back again to waiting. Waiting but not working.
It’s important to note that google reader (my current favorite rss reader) and google groups (likewise) have both been working all morning, at least for me.
Chronic Partial Failures are Typical
I have no idea how widespread either the gmail outages today, or the rolling app outages of last week are / were. Even worse, I don’t think our industry even has a good way to measure this phenomena.
It’s really a similar problem to the power utility industry. Everybody pays attention to the widespread outages – for example, last winter my family, along with more than 500,000 of our closest friends, were without power for four days after an ice storm … that was a startling, yet beautifully surreal experience in itself … perhaps a post for another day! – but not so much is even discussed about the much more common partial failures.
These partial failures effect some customers for part of the time, perhaps for one operation that didn’t work out, or perhaps stretching out for hours or even days. Unfortunately, I think this type of partial failure is typical in any type of scaled-out system.
It’s Not OK
All too often these sorts of failures (ESPECIALLY the partial ones) are waved off with cavalier comments of “typical”, or “what can you expect for free”, or some other such garbage.
That might have been OK when this was all new, and everybody was just thinking about how cool it was to have an (actually) usable service out there in the cloud, and wasn’t this all just great.
That was yesterday and this is today. It really isn’t OK for SaaS services to work sometimes and not work others … even the free ones.
And what about the truly enterprise applications?
What Can We Do?
This is precisely why we have been working relentlessly for the past six years to create a simple computing world that simply works. Ensuring reliability and scale at the architectural level (without requiring developers and operations folks to do a bunch of stuff each time) is absolutely essential to raising the bar on what we can all expect from scaled-out systems.
Do we need better metrics? Yes. Do we need SLAs with teeth? Absolutely. But we need more, much more – we need to deploy true fabric-like architectures, especially those suitable for the enterprise, and we need them now.
Quick update: in the time that it took me to write this post gmail is back up. At one level that’s good, but at another it’s not – not if the problem is ignored as if it doesn’t exist.
Update #2 – three hours later – broken again! (this time harder) Groups and reader still working mostly OK, though some transitory weirdness in reader.
Just ran across a very good post by Robin Harris from the misty dawn of time (last summer) stemming from the Google Scalability conference. Why should we care how Google scales? Like Robin points out,
They roll out new applications for millions of users with surprising speed, especially compared to corporate IT. They build data centers with hundreds of thousands of servers – and millions of disk drives – and run it all on free software.
Costly corporate kit, like RAID arrays and 15k FC drives, aren’t used. Yet they do more work in an hour than most companies do in a year.
Google’s IT capabilities are a modern wonder of the world. Underneath the complexity though are just three simple rules. Rules that no enterprise data center (EDC) would ever think of following.
What are Google’s three rules?
- Cheap (use commodity everywhere)
- Embrace failure
- Architect for scale
It is very interesting to consider how these three principles interact. For example, admitting that stuff breaks and making sure that that isn’t a problem takes care of a big concern about using commodity equipment. So does "architecting for scale", which takes care of another concern about using commodity gear – can I solve big enough problems?
In any case, what is the net effect for Google? Continuing with Robin’s post:
This is more than first-mover advantage. The faster they can grow, the greater their cost advantage over smaller, less nimble competitors. Their ROI brings them cheap capital, which increases their ability to invest in new businesses and more capacity. The higher their volumes, the cheaper growth becomes. A perfect storm.
All of this work is done by hordes of very smart folks at Google. Yet with all of the advantages, there are many limitations. There are well-known reliability problems, as well as complicated operations. Harris also points out that
Google’s purpose-built infrastructure is also relatively inflexible: they can’t just paste on (acid) transaction processing.
That’s where application fabrics come in. To this potent set of rules we add one or two of our own, absolutely necessary to make commoditization practical for the enterprise. What are the additional rules?
I wonder if there’s a “rolling brownout” in google applications today?
Earlier in the morning google reader (generally a really decent app to have around) was hanging, going into eternal “loading” screens (see below).
Since all of my blog / news feeds go through google reader (for now), I decided to switch gears and go research something. Except that google search was down as well.
Suspecting my own machine or net connection I tried gmail … and it was working fine. So were a number of non-google services. Hmmm.
A Rolling Brownout?
A couple of hours later the tables were turned. Same machine, same browser (really crazy, exotic stuff – macbook pro, firefox, etc.). Search was back up, the reader was fine, but gmail was down. I got this several times in a row.
and then this:
Gmail just started working again for me, half a day before the first outages. I have no idea how widespread this / was, if it’s really solved, nor even any idea why (other than it is very likely to be in the google “cloud”).
But that’s not really the point.
I have adapted my daily workflow to rely (in part) on some common SaaS offerings, and right now that’s not working out too well. Maybe that’s ok for an ad-supported offering in 2007 (especially if it’s eternally beta!), but how about the enterprise?
Would this level of (un)reliability be good enough for you?
Hope not … we can most definitely do better.
The fact that SaaS vendors can offer innovative, cool new services that are easy to start using and easy to operate is a given … nobody would argue with that.
But can an enterprise trust these offerings yet?
As Larry Dignan notes in a post about Coupa’s e-procurement offering, which is going to market on top of Amazon’s EC2 and S3, enterprise expectations are typically “five 9s” … that is, uptime of 99.999%.
What are Coupa’s chances of achieving five-nines as they stand now?
As I noted yesterday, I am really happy to see that Amazon is breaking the ice by offering an SLA for S3, their cloud storage offering. There are many questions about the quality of the SLA, but at least it’s a start.
What Are the Chances?
With that in mind, let’s look back at Coupa. For Coupa to work for a customer
- S3 must be up AND
- EC2 must be up AND
- the network access to EC2 and S3 must be up AND
- the user’s network and ISP must be up AND
- Coupa’s code must be working.
So in case you’re keeping score, that’s at least FIVE SERVICES THAT MUST BE WORKING for Coupa to work … any of them fail and *poof*, it’s nap-time. The only thing in this chain with an SLA is the storage (S3), and it’s SLA is only ok.
Suppose that Coupa depends on other services as well to actually perform the e-procurement function (it definitely does, btw … that’s part of the main point)? Well then, the chances that it’s all working go down even more.
What Does This Mean?
I think this has three practical ramifications.
First of all, folks that are looking for easy to implement, easy to operate functionality (the SMB target market for offerings like Coupa) will use the offerings as-is, and simply put up with the outages. And yes, that has worked for salesforce.com so far.
However, over time the drive towards higher SLAs (both marketing-commitment SLAs and actually-delivered SLAs) will be relentless.
So the second consequence will be that the raw-ingredient services, such as cloud computing and cloud storage, will begin to offer SLAs, and over time back up those SLAs with delivered reliability.
The Most Important Consequence
The third, and probably most important consequence, is that the service offerings themselves must engineer their applications to ensure their own reliability, scalability, and operational integrity. The traditional approach to doing this leads to the development complexity that plagues our industry like especially-vigorous kudzu in the deep south of the US – frustrating, hard to combat, and just gets in the way of anything productive.
That is precisely where fabric offerings like our EAF come in. EAF assumes that everything below the application will let you down, and simply protects against that automatically. So whether a service offering is hosted in-house or hosted in a cloud, the service itself will ensure that it is available and works as expected.
In a good move for the industry Amazon finally (in the past two weeks or so) put an SLA into place for S3, the storage half of their cloud computing offering.
Yes there are caveats, and yes the services levels are not enterprise-grade, but … at least they have something. And that’s a start.
Basically, if in a given month S3 is available less than 99.9% of the time you get a 10% refund for that month, if it’s less than 99% of the time you get 25% back. I think the times are cumulative, so they can either be one big outage or a bunch of smaller ones. No word on how this will be measured, or who’s observations will be sufficient to call S3 “down”. All worth consideration and important to customers, but at least they’re taking the first steps towards the inevitable.
Salesforce.com Holds Out
It is interesting to compare the posting of an SLA for S3 with the refusal of salesforce.com to do the same. I just find it really curious why Benioff and the rest of the folks at salesforce seem stuck in time on this point … almost like they’d rather paraphrase one of my favorite lines from an old Bogart movie, “SLAs? We don’t need no stinkin’ SLAs!”
Is a simple, toothless “Trust Me” really enough anymore? For an enterprise? For anyone?
SLAs Will Happen … Just About Everywhere
I think that SLAs are simply inevitable for enterprise SaaS offerings. While much has been written about how Google’s use of “beta” and “labs” tags have lowered service level expectations universally (and they most definitely have), I tend to think that SLAs (both marketing-promise-type-SLAs and actually-delivered-SLAs) will become key areas of differentiation – especially where there is strong competition, and double-especially where there are subscription revenues.
It’s simply inevitable. The real questions will be 1) whether to create a marketing-promise-SLA or an actually-delivered-SLA, 2) how to achieve the promised service levels, and perhaps more importantly, 3) how to achieve those in an affordable manner.
Note that a serious SLA doesn’t leave too much room for error. To get a quick idea take a quick look at a handy “downtime conversion table” that Dan Farber posted.
App Fabrics Make Reliability Practical
My guess is that one reason why Salesforce.com is dragging their feet on an SLA is because they’ve taken rather traditional approaches to building out their commodity infrastructure. That’s too bad, really, and just soooo unnecessary these days.
Think of the freedom of knowing that your computing substrate is simple, reliable, scalable … by definition, by design … in practice. That’s the reality with application fabrics.