Robin Harris must have woken up grouchy today – he’s dumping all over cloud hysteria on this fine Monday. After throwing the obligatory it’s-all-marketing punch (the truth is that there IS a bunch of marketing, but there’s also a bunch of real substance … more on that in a minute), he gets down to business.
I am paraphrasing a bit, but here are his main points:
The only real key to Google’s low cost structure is active cluster storage – if it’s productized, anyone can be as cheap as Google (including your own datacenter).
Networks are still the thinnest resource in the computing landscape.
Consequently only low-data-rate applications are suitable for the cloud – all others will (or at least should) stay local.
Robin makes some good, albeit incomplete points, though not too sure about his conclusion. Go read his post, then let’s look at his reasoning a bit at a time.
The Main Points
The only real key to Google’s low cost structure is active cluster storage - if it’s productized, anyone can approach Google’s economics (including your own datacenter).
This is probably the biggest miss – perhaps more critical than the reliable commodity storage (which is important!), are all of the applications which natively run on commodity infrastructure. Each app generally runs as well as that particular app needs, and runs in a way that allows for some sort of operational sanity.
Google (& Amazon & others) have built a number of frameworks to make this true for their own applications, of course. Sometimes they build these sort of capabilities directly into the applications themselves. For everyone else, there is a clear need for platforms that reliably scale applications on commodity infrastructure- that is precisely why we built the application fabric.
Simple, coherent operational capabilities are also crucial. When a commodity infrastructure can basically run itself, it becomes a lot more attractive as a deployment option for the serious enterprise.
Networks are still the most limited resource in the computing landscape.
True beyond a shadow of a doubt! Robin makes a good point that the rate of improvement for networks lags behind other parts of computing (like his native storage land). My only caveat is that, while clearly limited, network bandwidth is just as clearly sufficient for many, many mainstream applications (particularly when structured as described below).
Consequently only low-data-rate applications are suitable for the cloud – all others will (or at least should) stay local.
I think many applications will clearly stay local – some for technical reasons, some for security, control and / or cultural reasons, some just because.
Having said that, some data-intense applications will still move to a cloud, provided that the data is stored near the corresponding computing elements. This alternative is even now beginning to play out, such as in the Amazon EC2 / S3 combo (among others). With this approach all high-bandwidth data operations are effectively local.
In the rush to argue for or against cloud computing, many infrastructure-centric folks are missing a couple of key considerations – namely the critical nature of the applications and the need for simple operations.
Good grid-enabled applications (and this includes the storage layer) can run on commodity infrastructure wherever it’s located – in a cloud or close to home – scale as needed, be both reliable and secure, operate itself, and be far cheaper than apps today.
In reality the argument between clouds and grids / application fabrics can become simply a deployment decision – and that may be the best news of all.
I’d like to propose a simple thought experiment. Consider this question:
What if computing is free?
While we’re at it, assume that scale is always sufficient for the problem at hand, latency is acceptable, your applications always work, and that operations are cheap enough to be in the noise.
What’s the Point?
The point of this is simple enough. One answer to this thought experiment was Google … and that worked out pretty well.
Google would not be possible without commodity infrastructure, and apps that assume that they have (more or less) free, unlimited, access to that infrastructure.
Same for most of web 2.0 – after all, most bigger sites are (very loosely) built around some of the same principles. While there are some notables exceptions (EBAY) and many fundamental differences exist, the common meta-trend is that commodity is the right choice for the biggest, gnarliest, most demanding applications..
Now for the Enterprise
Yet that thought has not really begun to penetrate most enterprises. Kind-of commodity may be OK in a fairly stateless web tier, and perhaps for some occasional modeling or research apps, but elsewhere the closest are racks of expensive, heavily-managed blade farms.
Those blade farms may help with operations, but since those farms are normally driven from the operations side of the enterprise, they don’t mean much to the apps. Consequently, these farms haven’t done much for scale for most apps.
Plus they’re still expensive.
Of course, they ARE most definitely commodity when compared with the Z-class mainframes that still dominate the batch settlement / customer service operations that are so prevalent in enterprises the world over.
A Financial Services Example
We have a financial services customer who decided to instantiate this thought experiment – they’ve implemented their settlement infrastructure on commodity. Commodity organized by an application fabric (ours!), so that it is reliable, arbitrarily scalable, and very cheap to operate.
The results? They’re matching industry norms for settlement performance on Z-class mainframes with a handful of commodity boxes … and they can keep scaling for a few hundred bucks at a time. Plus it’s reliable, and never gets more expensive to operate.
That will change their industry.
Back to the Thought Experiment
Over the past couple of years I keep running into organization after organization that has existing operations built on the constraints of expensive, heavy, traditional computing. Constrained by state, constrained by the data tier, constrained by I/O, constrained by budgets … but mostly constrained by human nature, by organization inertia, by just thinking about the problem the way it’s always been thought about.
Whole industries, for that matter.
Time to change that – ask yourself, what if computing is free?
I often spend time with startups that are interested in building really successful, great big and hugely profitable companies. Thinking big thoughts from the beginning …
One of the things that these guys see in the fabric is an opportunity to cleanly scale as they gain customers. Rather than living with the limitations of traditional architectures, these guys figure to start right from the beginning.
I Mean Full Service
Now I should point out when I say I work with guys like this, well I mean exactly that.
In fact, I’m even hosting a company that’s currently in stealth mode at my house. I don’t mean near my house, I mean in my house. 24 x 7! Eat, sleep, and WORK there … one of the guys doesn’t even have a car, so not much slackin’ happening with that team. I even cook sometimes, which is way more than any combinanator-y thingee is doing, to be sure.
They’re doing some great new stuff that I think will turn into a pretty cool platform in a very under-served market, so they’re planning to succeed.
That’s why they’re relying on Appistry to help them handle success. By making that choice they can focus on their core functionality, and we’ll ensure that their platform will scale as needed, run on commodity, be mindlessly simple to operate, and just work.
Why Did They Do This?
Pretty simple. Since EAF handles scale, they can just plan for success from day one. No need to scramble when they’re successful, no fear of driving away traffic.
Maybe Microsoft can survive their widespread Xbox Live problems over the recent holidays, but there’s not a startup on this planet or ten others that would make it. When you get your chance to impress a new visitor, your stuff better work well.
Starts With a Decision
It might have been necessary in Mesozoic Era of Web 2.0 to just throw something together to see if anybody cared, then build it over if you found out somebody does … but to do that anymore is just dumb.
I suppose I ought to put that in more elegant terms … to plan on redoing your applications for each major surge in growth is poor business.
So plan on scaling beyond your wildest imaginations. That may seem like a Captain Obvious statement, but it can be easy to overlook. And overlooking this point can lead to BIG problems.
OK Is Good Enough
Recently I was working with another stealth-mode startup that also has some very cool new stuff. This cool new stuff has two major areas – they already had decided to use the fabric to impart all of the fabric goodness to part 1. So far so good.
Then we got to part 2 and that’s where their plans still needed some work. As we went through the system architecture for that part it became clear that most traffic eventually hit against a single database.
I got pretty worked up in this conversation, made an impassioned plea for DBs being the store of last resort, pointed out how basically every site that has gone to serious scale has moved DBs out of the transaction flow. Stuff like eventual consistency, shards, and many other interesting techniques are absolutely required for serious scale.
All to no avail – they were happy with the single DB.
This really puzzled me. These are seriously smart guys about to make a really bad decision. One that everybody who’s gone far down that path knows is a really bad decision. Why are thinking this way?
The Root of the Problem
As we talked some more (and I calmed down a bunch … mostly) the reason why became clear – they’re not planning to need any scale in the customer-facing part of their app.
In other words, they’re not planning on world domination, sucking up every possible web visitor and turning them into a sticky customer, building the sort of traffic that drives the very real Bubble 2 economics (at least this bubble has very real revenue streams!).
So naturally they’ were satisfied with OK. But I doubt if their future investors will be.
Moral of the Story
My guess is that upon further reflection they’ll see that making all of their apps fabric-enabled, making them capable of running on a lovely uber scalable, reliable grid of commodity boxes (your own or cloud-based) just makes sense.
It’s in their interest and it’s easy …
So my guess is that they will do what the other company (the one that I’m hosting in my house) did from day one – decide to fabric-enable everything, so that their crazy business guys can do everything possible to obtain every customer possible for as long as possible, driving as much total revenue as possible …
… and live happily ever after!
(well maybe the fabric can’t help with that last point, but it can do all the other stuff!)
Sam Charrington, colleague and friend, spent a few days at the Gartner Application Architecture, Development, and Integration Summit show last week. One of the more interesting things about Gartner shows are the analyst briefings. While there is no single place that can definitively define what’s going on in markets as diverse as those in which we participate, these briefings are a good place to take a snapshot of what things look like today.
Besides, the quality airport time always gives you a chance for reflection, a time to ponder where the market has been, where it is today, and where it’s going. The weather must’ve been bad, because Sam came back very reflective!
When We Started
After we’d incubated a bit, had technology in hand and had started making the rounds of prospective investors, one refrain that we heard over and over again could be summarized something like this:
Why bother? Everything about application development has been settled for good. There’s no room for any more innovation in application development and deployment.
I personally heard this too many times to count. It was almost like a technology version of the French knights in In Search for the Holy Grail , as in “we already got one of those!” (about a minute or so into this linked video).
While folks like Massimo Pezzini (a Gartner analyst who covers this sort of thing) and a handful of others didn’t buy into this line of thinking, they thought we might be talking about a new niche, something out at the edge. Massimo coined the term XTP (extreme transaction processing), and has been building on that theme.
At the summit last week Massimo observed that
By 2012, mounting user need for XTP applications and technology innovation will propel at least one new software vendor into leadership in the application platform market with more than 15% market share in the XTP platform segment.
In fact, he recently issued a report entitled The Birth of the Extreme Transaction-Processing Platform: Enabling Service-Oriented Architecture, Events and More. Great report, well worth the read.
The name alone tells the story … big innovation is here. Sam brought all this back from the summit, and that got me to thinking.
That’s a really, really good question. I think there’s a number of factors, but two really stick out to me:
- Scale. Most of the status quo architectures just can’t keep up with where conditions are driving the enterprise. Call it a fire hose, customer demand, web-scale, or competition – in fact, call it what you like – the simple reality is that in the early days of the third millinea enterprises (both new and old) need their computing infrastructures to scale. Scale really, really well, and do it simply, reliably, and cheap.
- Desire for Simplicity. While a certain amount of complexity may be inevitable, anyone who is deeply involved in writing, deploying, or operating applications today knows that there are just too many moving parts, they’re too hard to move and arrange as needed, and they simply don’t work well enough. Stuff breaks when it shouldn’t, it’s hard for enterprises to
I think the desires for these had been forming for some time, but the rise to dominance of such high-profile players as Google, Amazon, and a few others have shown that new rules are possible. That maybe, just maybe it might be possible for an enterprise to conceive of their application and computing infrastructures scaling as needed, working reliably, being simple to operate, and deployable on commodity infrastructure.
While the rise of the whole virtualization industry is a partial answer to the “desire for simplicity”, it’s not the whole story, not by any means. In fact, it is the inability of the existing players (you know who you are!) to shake the chains that bind them is what has opened the doors for new players.
This is now cold, sober reality. Simple, reliable, easy to operate scale on commodity infrastructure. Here today, in production at serious enterprises, in the core of their operations – where it matters.
I think the next six, 12, 18 months are going to be very exciting times indeed. As Massimo indicates, I too think we are contributing to the birth of a new platform.
While there may be others who eventually make it, we are driving hard to continue earning the trust of our customers, partners, and the community so that we are the first “new software vendor” who takes a “leadership position … with more than 15% market share”.
So that’s a reasonable next step, but more is definitely possible. Much more.
That question got me to thinking a bit, and as I promised in A Moore’s Law for Software – Part One, here’s a (very) modest proposal.
More on the Underlying Problem
When you really think about Moore’s Law, it’s a measure of raw capacity. Capacity, but not capability. It’s the job of software to take that raw capacity and turn it into useful capability, but save that thought for later.
In the meantime it’s easy to see how Moore’s Law operates (raw, first-order capacity doubles every 18 months) when measuring stuff like transistors on a chip (the original), disk drive capacity, or perhaps network bandwidth.
But these three examples are not really the same anymore.
In a way they used to be more alike, back in the day when most applications generally ran on increasingly faster single processors. In other words, the increased transistor count on the processors basically translated pretty quickly into increased capacity for running applications.
But all that began to break down when the organization of those transistors became more complicated. The underlying limits to clock-times on the chips limited i/o capacity, memory bandwidth, and ultimately processor speed. In order to deal with that system architects and designers have always introduced forms of parallelization – pipelining, multiple data paths, caching, multi-cores, multi-processors – some of which are transparent to software, most of which are not.
Making use of the forms of parallelization that require a developer to take conscious action is not so easy. In fact, the mainstreaming of multi-core chips is bring this problem front and center …
… where it is combining with system-level architects’ desire to increase system level capacity and resiliency by making use of many individual processors, perhaps even commodity processors.
These meta-trends are combining with several others (uber-cheap hardware of all kinds, nearly unlimited network bandwidth, ubiquitous mobile devices, and more) into a “perfect storm” for application developers – where we are in danger of dying of thirst in the Garden of Eden.
In other words, we have an abundance of raw system-level capacity, and our apps still can’t scale. Which leads me to my proposal.
A Moore’s Law for Software
Software must be able to scale. Whether it’s the latest web-scale social network, an enterprise ERP application, analysis for an intelligence agency, or video software on your iphone, apps must scale, period.
Except that most of the time applications don’t do that very well.
While we all know that needs to change, and there is much attention focused in this area, I think we need to articulate this effort and give it a name, so here goes:
Software should be able to reliably make use of all raw capacity provided by the infrastructure on which it executes.
There, that’s it. Perhaps we can call software that generally meets this criteria to be Blaise Capable, maybe even give it a nice little logo to acknowledge the capability.
A Few Questions Answered
Why Blaise? I’m proposing that we name this after Blaise Pascal, a brilliant 17th-century French mathematician and philosopher. I’m hoping that he might find this little idea interesting someday.
Why a premise? Because I think we should just be able to assume that software scales. While I know that our industry is a long way from that goal, and that the stuff that stands in the way of this is legion, it is most definitely a worthy, and increasingly achievable goal.
Why not a measure of capacity? Because software is abstract, malleable stuff. There are no fundamental “units of software capacity” that span the range of applications and remain constant over time, and I do not think such a thing is even rational to consider. Let the hardware guys provide the capacity, and leave it to the software industry to transform that raw capacity into useful capabilities. Blaise’s Premise is essentially a statement that this transformation should be efficient – 100% efficient, ideally.
This is the great struggle in which we find ourselves in this industry, for many different forms of software. Those who can do this well, and who make it very easy to do for a wide range of software and system architectures, will likely prosper. Those who cannot … well, there’s always some nice vineyards in Napa to explore.
At any rate, I’m open to suggestions and comment. If you like this idea, please spread it around – I think it will help our industry focus on the need to create software that scales easily and reliably.
Perhaps then we will begin to fulfill the promise that computing has contained all along.
Just ran across a very good post by Robin Harris from the misty dawn of time (last summer) stemming from the Google Scalability conference. Why should we care how Google scales? Like Robin points out,
They roll out new applications for millions of users with surprising speed, especially compared to corporate IT. They build data centers with hundreds of thousands of servers – and millions of disk drives – and run it all on free software.
Costly corporate kit, like RAID arrays and 15k FC drives, aren’t used. Yet they do more work in an hour than most companies do in a year.
Google’s IT capabilities are a modern wonder of the world. Underneath the complexity though are just three simple rules. Rules that no enterprise data center (EDC) would ever think of following.
What are Google’s three rules?
- Cheap (use commodity everywhere)
- Embrace failure
- Architect for scale
It is very interesting to consider how these three principles interact. For example, admitting that stuff breaks and making sure that that isn’t a problem takes care of a big concern about using commodity equipment. So does "architecting for scale", which takes care of another concern about using commodity gear – can I solve big enough problems?
In any case, what is the net effect for Google? Continuing with Robin’s post:
This is more than first-mover advantage. The faster they can grow, the greater their cost advantage over smaller, less nimble competitors. Their ROI brings them cheap capital, which increases their ability to invest in new businesses and more capacity. The higher their volumes, the cheaper growth becomes. A perfect storm.
All of this work is done by hordes of very smart folks at Google. Yet with all of the advantages, there are many limitations. There are well-known reliability problems, as well as complicated operations. Harris also points out that
Google’s purpose-built infrastructure is also relatively inflexible: they can’t just paste on (acid) transaction processing.
That’s where application fabrics come in. To this potent set of rules we add one or two of our own, absolutely necessary to make commoditization practical for the enterprise. What are the additional rules?
One of the often-repeated baseball truisms is “that you can never have too much pitching”. Even if you don’t know anything about baseball, you can tell that this is true by just searching on that phrase and see what comes up. Go ahead: I’ve made it easy!
(for the non-baseball folks out there Bob Gibson is one of the absolute all-time greats, a pitcher’s pitcher … every baseball team that ever was or ever will be would love to have Mr. Gibson on their team)
Simplicity Really Matters
In the world of scalable applications there is a rule above all rules – simplicity really matters. Or in tribute to the tattered, yet still great game of baseball, “you can never have too much simplicity”.
You can say this many different ways, but the reality is that in order to really build scalable systems we must strive for the simplest abstractions possible.
For a minute I thought I was reading one of our new marketing pieces (I wasn’t) … Nikita Ivanov seems to be all over the “scalability simplified” theme. Of course I agree with his basic point, but there’s more to the story of course.
Making It Real
Even Ivanov’s jab at Nati Shalom illustrated an underlying reality, ignored all too often – enabling a simple world can be complicated. Of course any complexity needs to be supporting an elegantly simple abstraction, such as the one we present. The problems arise when that complexity is exposed, as it is in the vast majority of computing architectures.
In any case, just arguing for development simplicity (while commendable) isn’t enough. After all, somebody has to deploy and operate what you build.
The Whole Story
So yes we must deliver simplicity to the developer … that is a key for enabling scalable applications. But don’t forget the other two legs to this stool:
- Operational Simplicity. The biggest fabrics (or grids) absolutely must be at least as simple to operate as a single server … no matter how big they get.
- Reliability. A fabric must be able to simply ensure the reliability of each operation – this is crucial for being able to rely on commodity infrastructure.
Taken together (development simplicity, reliability, and operational simplicity), then you have an approach that’s meaningful. That is exactly what people are discovering with application fabrics.
Go ‘git me some of that simplicity!
I wonder if there’s a “rolling brownout” in google applications today?
Earlier in the morning google reader (generally a really decent app to have around) was hanging, going into eternal “loading” screens (see below).
Since all of my blog / news feeds go through google reader (for now), I decided to switch gears and go research something. Except that google search was down as well.
Suspecting my own machine or net connection I tried gmail … and it was working fine. So were a number of non-google services. Hmmm.
A Rolling Brownout?
A couple of hours later the tables were turned. Same machine, same browser (really crazy, exotic stuff – macbook pro, firefox, etc.). Search was back up, the reader was fine, but gmail was down. I got this several times in a row.
and then this:
Gmail just started working again for me, half a day before the first outages. I have no idea how widespread this / was, if it’s really solved, nor even any idea why (other than it is very likely to be in the google “cloud”).
But that’s not really the point.
I have adapted my daily workflow to rely (in part) on some common SaaS offerings, and right now that’s not working out too well. Maybe that’s ok for an ad-supported offering in 2007 (especially if it’s eternally beta!), but how about the enterprise?
Would this level of (un)reliability be good enough for you?
Hope not … we can most definitely do better.
He asked something very simple: How come there isn’t a Moore’s Law for software? That felt good, just writing it. So I’ll repeat it. How come there isn’t a Moore’s Law for software? The way Alan asked it, there was an underlying innuendo. That we were wrong about many things we’ve done in the past thirty years, in terms of networks, operating systems, programming languages, hardware, applications, the lot. That the way we built them was wrong, and that we continue to compound the error.
Indeed, that’s a great question. Why isn’t there a Moore’s Law for software? What would it mean to have one?
The classic Moore’s Law is pretty simple in concept, and has proven amazingly durable … namely, that transistor count for a cheap device would double every 24 months. Variations of this for aggregate computing power, storage, and a few other related areas have proven out as well.
Why Doesn’t This Apply to Software?
The original Moore’s Law (in all it’s forms) really applies to individual "computing elements". So as long as an application is willing to live within an individual system the software will scale with that element.
But that almost never happens, as we all well know. Why not?
First of all, developers love to use increased power to add functionality. Well sometimes for functionality, but probably 90% of the time for thick layers of eye-candy. Why do that? Pretty easy, really – customers like it. When customers like the eye-candy, they buy more software. When customers buy more software, developers and their shareholders are happier. Makes sense!
This tendency in and of itself is pervasive enough that is a very large factor. Probably the simplest example is creating a document. In the late 70s you might create a plain text file on a green screen, and if you wanted to get fancy you’d throw some formatting commands into it and *bam* you were done. That same task today might take three or four orders of magnitude greater computing power to do the same thing. Easier? Probably. Flashier and more functional? Undoubtedly. Heavier and far more demanding? Absolutely.
Even the actual, unadorned functionality additions blur the point. How much functionality has been added? When have you "doubled" functionality? What does that even mean?
Which brings us to the second point – "software capacity" is very difficult to define, almost impossible, really. Many people have tried, many people will continue to try. All of that effort is fine, but mostly besides the point. Even if we can’t analytically measure "software capacity" we know there’s a problem.
The problem shows up in a million different ways – unhappy customers, inability to grow your business, customers being dropped from websites, transactions lost … never to be found again.
Back to JP Rangasami’s thoughts on Alan Kay’s challenge:
I’m particularly taken with his challenge on scale, his accusation that we don’t design things that really scale.
With processing power, storage, and even many forms of communications bandwidth marching relentlessly into new frontiers of capacity, why can’t software scale?
Simple – scaling the old-fashioned way is just hard.
The Old-Fashioned Way
The old-fashioned way is really just brute force, at least from the app developer’s perspective. Take care of your own scale. Plan for and deal with failure everywhere. Steep yourself in messages, threads, state mechanisms, and complex distributed architectures everywhere.
Mull over the failure conditions. What haven’t you thought of? What could go wrong? Patch here, patch there. Worry about race conditions, deadlocks, and just weird behavior.
Then put it into production and carry a pager.
There’s a real irony, a classic conundrum here – the exact stuff that you’re going to do to enable scale, to make use of all of that Moore’s-Law-hardware is precisely where your problems are most like to occur.
You simply know the raw capacity is there … so how do we use it?
Up next: a simple proposal.
Michael Krigsman has been humming the "simplicity is good / complexity is … not so good" tune lately – check out this and this – and it sounds pretty good to me. While his focus is primarily on self-induced organizational complexity, I think the same exact points apply to architectural complexity.
In fact, this is how I tend to describe the abstraction presented by an application fabric:
each service and application scales as needed, always work as expected, and manages itself.
Makes sense, doesn’t it?
In practice, the only people who actually like architectural complexity either
- don’t think they have any choice,
- don’t have to code for or operate the resulting apps,
- are so far down into the bowels of the existing, uber-complex architectures that they’ve forgotten that there is a world above-ground that is their natural home, or
- are just trying to show off.
While Todd Fast of Sun makes an interesting point against a sort of "false simplicity", I think that is really a different issue and a bit of a red-herring (which I’ll take up again later).
For my part, I’ll choose architectural simplicity each and every time!
Note: the image is from the cover of a great book, "The Evidential Power of Beauty" by Thomas Dubay, which explores the meaning of simplicity and beauty in the physical world.