This is Part 1 of a planned three part series that traces the evolution of the CloudIQ Platform from first idea to what it is today, then considers what it is likely to become.
Right after graduate school we spent a few years living in the foothills of a desert mountain range. I remember the first time that I hiked a narrow trail to the top of the nearest peak – standing at the bottom was rather intimidating; the circuitous ascent itself was such a tangled mixture of switchbacks, short ascents and descents, under cover and open trail that it could only be described as non-obvious (at best). However, it was only upon achieving the original goal that we gained any perspective on how, in fact, we had gotten there.
In this post I will reflect on the evolution of CloudIQ – the truly exciting (if I must say so myself!) cloud application platform that we announced a couple of weeks ago.
As some pondered the impending Y2K “crisis” and others looked for the best millennium parties, most of our founding team was deeply enmeshed in building, selling, and supporting an enterprise-grade (scalable, reliable, etc.) payment server.
Upon leaving that company I had time to reflect on my enduring frustrations -
why was it so hard to build software that we (and our customers) could rely upon?
It seemed that we were spending 60 to 70% of our engineering efforts not on core functionality, but in our best attempt to ensure that the resulting application could be relied upon.
Later on an early supporter coined the term “reliability tax” to refer to this overhead.
As I asked friends at other companies and enterprise shops most recognized the same problem – a few argued the overhead was actually higher, most thought that the cruel irony was that it was very, very difficult to ensure true reliability for enterprise apps – but all agreed that this just didn’t seem right, not nearly 50 years after Gracie Hopper did her most famous work.
With this question just really bugging me I had an opportunity to build the beginning of a digital recording studio. Using completely commodity gear – no name, cheap- I was genuinely shocked at the results. Serious performance, cheap.
So that led to the second question -
why weren’t we using commodity stuff like this for problems that we really cared about?
The answer to this seemed easy enough – who could trust this cheap stuff? What if it broke? (and it would break).
In pondering the first question it seemed to us that the core problem for software development was one of complexity – mainstream application architectures were simply too complex, and becoming inexorably more so.
The First Idea
Then it became fairly clear – we could solve both of these problems at the same time by enabling groups of commodity boxes to work together to ensure a stable platform for applications.
But what exactly did that mean? Or more to the point, could we build it? Ever MORE to the point, once we built it, how could people use it, and for what applications would this new thing be useful?
Over the course of a few months the founding team hammered out the first answers to those questions. Throughout this process we were driven by use cases – we wanted the resulting platform to be equally adept at running anything from fine grained, transactional applications to more computationally-intense enterprise applications.
This led to much refinement of the basic idea, which evolved to a self-organizing group of commodity machines that could act like one thing, reliably execute all sorts of applications, grow (and shrink) as needed without affecting the execution of any running applications, and be very simple to both write applications and operate.
The Hive is Alive
We decided to call this hive computing, and on June 12, 2002 we had the first successful demonstration of a running hive. We assembled a few commodity boxes on a re-purposed kitchen rack, loaded the prototype hive software, and … it worked!
We were able to (carefully) pull a few plugs and the application kept running without missing a beat – in fact, without even losing a bit of data.
Within two years we had our first paying customer (Sprint), a couple of patents filed, and a demonstration system on which we ran an eye-opening benchmark – a wall of 100 commodity computers that could legitimately double Visa’s then-current peak transaction load, for a total bill that was well under 10% of the conventional alternative.
The best part? It was arguably far more reliable as well. We were constantly amazed by the resilience and ease of use of this new type of application platform … though truth be told, we were not yet ready to use the “P” word.
In our pursuit of the possible we (the founding team) sometimes thought that economics would do all the persuading for us. Well that turned out to be sometimes yes, but mostly no.
In fact, sometimes economics actually worked against us – the combination of 90% lower costs, simpler development, easier operations, and simultaneously increased reliability and scalability simply seemed too good to be true for many people.
The fact that we required some modification of the application also made adoption more complicated. While we supported several languages and multiple operating systems (and could easily support more of each), the plain, simple truth was that you did need to modify – albeit lightly – many components in each and every application.
This raised the adoption barrier a bit higher.
Then there was a little matter of language. Without a native category to call our own, sometimes we were put into all sorts of categories – everything from grid computing to autonomic computing, with several others between.
Early on I even told some folks that we were basically building the “Borg for applications”. While hard-core geeks loved that (and usually laughed), it didn’t exactly help us build trust with the typically non-technical executives responsible for making the final purchase decisions.
Yet, It Worked … Well
Despite these go-to-market difficulties the product itself worked well – really, really well. In fact, by mid 2002 several of us became firmly convinced that beyond a shadow of a doubt there would be a time in the future – say 10 or 15 years – where most mainstream computing would be done this way.
The economics and functional advantages were simply too compelling for any other outcome.
The only real question in our minds was when and who – when this transition would begin to occur and who would help make that transition happen.
So as 2004 came to a close we pondered solutions to these issues and continued to press rapidly forward.
In Part 2 we will talk about why this worked so well, and the transition to the application fabric.
(with apologies to the good Dr. Seuss on the title – sorry, I just couldn’t help it)
The participants included
- Robert X. Cringely, computer guy & moderator
- Anwar Ghuloum, Intel
- Charles E. Leiserson, Cilk Arts & MIT
- Dan Reed, Microsoft
- Mark Snir, University of Illinois at Urbana – Champaign
and me (go ahead and give me a hard time, I can take it).
The range of discussion was interesting, since the panel included perspectives more rooted in multi-core (Ghuloum and Leiserson), mass ‘o machines (Reed, Snir), and more of a uniform view of both broad classes (me, though I think that’s may be shared by some of the other folks as well, at least a bit).
In addition, the panel was a mix of research and practical applications, which probably tended to color much of the discussion. All in all it made for cool (and hopefully not too boring for the audience!) conversation.
This is one panel that I probably would have much rather had in private, definitely accompanied by some really good adult beverages, but unfortunately we were constrained to an hour on a stage … and (at least for me) that hour passed by pretty fast.
An hour was probably enough time to begin to name a couple of the larger issues, but definitely not time enough for too much more.
Still, it did get me to thinking a bit …
A Few of the Issues
There were (and are) many more of course, but here are a few of the more dominant themes and issues that we discussed …
Market Pull. Whether it’s the inability of the processor manufactures to build individually faster cores at any price we could stomach (hence the advent of multi-core), or the advent of practical clouds (both public and private) opening up the prospects of deploying REALLY BIG apps on LOTS of VMs, the market is clearly demanding new solutions to creating parallelized apps. No question about it.
Complexity is Bad. There was a general agreement that complexity is, well … complex and generally toxic to effective development of parallel apps. Some folks had more of a stomach for complexity than others, but all in all many of the efforts are trying to fundamentally simplify the developer’s task.
Need for New Abstractions. The Complexity Problem is not going to be solved by wishful thinking alone, no matter what Oprah says (sound bite alert). Hence everything from new functional languages like F#, Erlang, Scala, to frameworks like map-reduce, to data-driven reliable service abstractions like our own application fabric are in play as ways to simplify.
Uncovering Inherent Data Orthogonality. I’ve gradually come to the opinion that some very high percentage of the apparent data dependencies that are anathema to effective parallel processing are not truly in the original problem. Rather, they are false dependencies, ones that we have inflicted on ourselves for no particularly good reason other than the tools, methodologies, or just bad habits that we bring to bear on our work.
(btw, don’t press me on a precise percentage or I’ll be forced to make something up here)
We’ve seen this with customers, and the more I look at new problems and how they are solved in most enterprises today, the more I see a big, massive goo of false dependencies.
Fix those, and we have a crack at effective parallelization in many cases.
Where This is Going
I am very optimistic about progress in helping developers create actual parallel applications that can be used in the enterprise, in production solving problems about which people actually care.
The population of these well-done apps is going to be increasing dramatically in the months and years to come, which is a good thing … a very good thing.
The timing couldn’t be better … in truth, I don’t think we really have much of a choice.
We are very active in this space, and I have a particular interest in the “false dependency” problem. I’m sure I’ll be posting more on this in the future.
Almost lost in all of the twittering about Twitter’s twubbles last week (sorry CrazyBob, couldn’t resist that one!), was the unintentionally quiet announcement of the “Spring Application Platform” (more official posts here and here from springsource).
An interesting post from Per Olesen (j2ee developer) highlighted these positives:
At a first glance, it looked to me as a lot like a server like the JBoss micro kernel architecture, which could (can) be tailored to only run the exact parts of JEE, that your application needs .. At a second glance, this is actually just a minor part … they are also using it (OSGi) as the technology for deployment units for the applications running on it … that is where I see some benefits
It looks like what others and I have been planning/hoping to do over the next few years. Most of us are looking for an OGSi based distributed platform with a commercial friendly license (EPL, BSD or Apache)
I see this as the new JVM, a module or bundle oriented runtime that’s also distributed.
I think that probably represents the POV of many java vendors.
On the other hand, Billy wonders how long the additional work can be licensed (as opposed to being treated as OSS commodity):
Spring DM is Apache licensed, I can see the extra work in the Spring server being clean roomed and made available with EPL or Apache pretty soon and this will remove value from selling the SpringSource server<
Phil Zolo is concerned that this hurts the Spring framework itself:
The Spring Application Platform is the biggest announcement to come out of the Spring team for some time. It also looks like it could be a big mistake. Spring became popular in the first place as a practical, community driven solution to the real problems with Java enterprise applications, with a focus on simplicity. The latest offering seems to be moving in a rather different direction.
Truth is, I care a little bit but not a lot. To me this is a VC driven move … It is the same thing you had yesterday for free, except it is now under the GPL and a proprietary subscription license. I laugh.
Finally I am fuzzy on how this impacts their relationship with other app-servers. They are not neutral anymore.
Rod is wrong on a couple of things: I DO understand the technology enough to call it out for what it is “an emperor has no clothes” attempt to monetize his ISV base … this is almost 10 years old. What is new is the licensing gimmick … Your users are not dumb, they see right through this flat footed license change, don’t get mad and patronize them when they call you out.
Leaving the personal animosity aside that seems to mix into springsource / jboss conversations, a few points are flowing here and elsewhere:
- There’s value here, but it’ll eventually be done in a clean commodity version (for all, including ISVs).
- Legitimate concern that the Spring framework itself will get stale
- This still doesn’t help with operations, nor does it do much for reliability (a post for another day)
It’ll be interesting to see more community reaction at javaone.
Yesterday we talked about whether Twitter really ever need to be reliable or not … some said yes, others contend that it’s not necessary.
It’s been bugging me for awhile that something this popular … and Twitter is so … just keels over as often as it does.
Anyhow, the whole argument turned into a bona-fide debacle this morning when GroupTweet (a relatively new feature that seems to have been confusing) was at the heart of disclosing private messages (DMs in tweet-speak) to tons of folks.
So now it looks like Blaine Cook is out as chief architect, and Michael Arrington is calling it the end of amateur hour. That’s probably a bit harsh, because my (limited) interactions with Cook have been pretty decent.
Btw the comment thread on that last post is going crazy. My favorite so far is a short video comment from a Loren Feldman (warning … his language is a bit over the top, but you do know where he stands!) Btw, check here if the first link to the video doesn’t work.
Having said that, we just have to build apps that act like real, grown up (and you can call that boring if you want) apps … taking care of the data entrusted to them, working as expected, and working when we need them to work.
So I’m thinking that the answer to yesterday’s question is … YES. Twitter does need to figure out how to be reliable … and secure, scalable, and all the rest.
This is exactly the point that I’ve been making for awhile … why build to POC quality when it’s now possible to ensure reliability, scalability, and so forth from the beginning?
People are Still People
I don’t care if this is Web 2.0, Enterprise 2.0, or Web 10,000,000,000.0 … consumer or enterprise … people are still people. They still care about their privacy, the reliability of stuff that they come to rely on, basic stuff like that. No free pass.
Even consumer oriented web 2.0 apps need to ensure this, from the beginning.
Pretending that innovation in communication, biz, or technology somehow exempts us from the basics of social interaction is just … well, it’s just wrong.
This lesson is for our whole industry. Those who learn it will prosper, those who don’t …
A few outages ago I wondered aloud whether Twitter was taking the whole business of failure somewhat casually (triggered by some comments Blaine Cook made at SXSW).
Blaine replied with some great points, including
For the record, saying that the press surrounding the downtimes was a plus was a joke. Downtime is never good, and you should do everything you can to avoid it. However, it’s a misrepresentation to say that you can build something successful without any downtime.
Our foremost concern has been and will continue to be ensuring a stable platform; we’ve been working hard on numerous fronts, and that work is paying off. Bad press is horrible, and I’ll be the first to take pleasure in never again seeing a “can Twitter scale?” story.
I believe him, and have quite a bit of empathy for the position he’s in. (I have a friend who always used to talk about “high class problems” and “low class problems” … Blaine and the other Tweetsters have a high class problem, but that’s a post for another day)
There was one point that he made that I fundamentally disagree with, however:
Scaling is a commitment, and one you should only make once you’re sure about an idea.
Yes it is most definitely a commitment, but it is my contention is that we’re fast entering the time when it can be built-in from the beginning with little to no additional effort.
This past weekend there’s was a bunch more instability. Looks like they were putting in some more caching to take the heat off of the data tier, and things went wacky.
Now Robert Scoble is making the case that Twitter is leaving the door wide-open for Friendfeed … (on Twitter of course, though my friend read it on Friendfeed!). Check out this short burst:
Michael Arrington thinks that the mass of the Twitter community makes this concern moot … basically, he contends that Twitter no longer needs to be reliable.
The Days of the Free Pass on Reliability are Over
Basically, I think that 1) the days of web2 services getting a free pass on reliability are rapidly passing, and are probably already over, and 2) it’s a shame to see stuff go whump when it’s sooo unnecessary.
As for the days of the free pass being over, check out Dennis Howlett’s (zdnet) comments on the most recent outage … he’s generally making the case that Twitter itself is really a POC for some better service yet to come, something more suitable to much larger markets.
Could that V1 service be Friendfeed? Maybe. Of course, it’s too early to write-off Twitter entirely … they’ve also hired their own scaling calvary (including the every-helpful Google expat!), so maybe they’ll catch a second wind before the whole sector passes them by.
Build to Scale … From the Beginning
Back to Blaine’s comments. I can completely understand the notion of building a proof of concept … besides, in the web 2.0 world it’s long been accepted practice to throw something out there, and only build to scale when you figure out whether anyone cares.
That makes a lot of sense when building apps to scale is so freakin’ hard. BUT … easing that pain is precisely the point of stuff like our app fabric.
That is why it is my core contention that the ability to scale and be reliable, even for the most trivial services, is going to become the price of entry very soon (if it has not already become so).
For the past year+ there have been many indicators that fundamental changes in the enterprise software development market are well underway. In particular, it sure seemed like the monolithic predominance of traditional JEE app servers was starting to break up.
A few weeks ago I posted about the rise of Tomcat, and talked about why it is now the leader for deployment of Spring apps. (the why is easy – it’s simple, cheap, easy to use, and works well).
Now Rod Johnson (springsource) has another interesting observation – job postings requiring Spring skills have surpassed those requiring EJB on at least one site.
Indeed.com shows that in November, 2007, Spring overtook EJB as a skills requirement for Java job listings. As of yesterday, the respective job numbers were 5710 for Spring against 5030 for EJB.
… While it’s not an apples-to-apples comparison, it is reasonable to consider Spring and EJB as alternatives for the core component model in enterprise Java applications. And it’s clear which is now in the ascendancy.
… Frankly, the EJB era was an aberration.
What This Means
EJB is often inexorably intertwined with the decision to use monolithic, heavy, traditional app servers on heavy, costly infrastructure.
The trend towards breaking this monolith up does start with the core object model, and Spring is proving fairly prominent in this role. Once this decision is made, then separate decisions can now be made about how to achieve scale, reliability, and operational integrity.
If the applications are not so demanding, then not much more than Tomcat and a few operational tools are required. Many early cloud deployments probably fit this category.
Need for Scale and More
Once the app has more demanding scale, reliability, performance, and other needs then the developer has been faced with a couple of choices. These include
- Deploy the new lightweight app on a traditional app server & infrastructure.
- Write their own state management, coordination, and operational tools to deploy on lighter infrastructure
- Pick a new approach for development and deployment facilities.
It is in this third option that most of the interesting innovation is occurring. That is where our excellent execution model (simple abstraction, multi language, multi OS, highly scalable, reliable and fast), elegant state facilities (lightweight, reliable process flows & spaces), and very simple operational model (the biggest fabric is the same thing to operate as a single server) make a lot of sense.
That all of this can occur on a truly commodity infrastructure (from Tomcat to Linux to grids of uber-cheap commodity processors) is a real bonus.
Best part? You can bring your Spring app over as-is, and gain much of the Appistry goodness from day one.
I’d like to propose a simple thought experiment. Consider this question:
What if computing is free?
While we’re at it, assume that scale is always sufficient for the problem at hand, latency is acceptable, your applications always work, and that operations are cheap enough to be in the noise.
What’s the Point?
The point of this is simple enough. One answer to this thought experiment was Google … and that worked out pretty well.
Google would not be possible without commodity infrastructure, and apps that assume that they have (more or less) free, unlimited, access to that infrastructure.
Same for most of web 2.0 – after all, most bigger sites are (very loosely) built around some of the same principles. While there are some notables exceptions (EBAY) and many fundamental differences exist, the common meta-trend is that commodity is the right choice for the biggest, gnarliest, most demanding applications..
Now for the Enterprise
Yet that thought has not really begun to penetrate most enterprises. Kind-of commodity may be OK in a fairly stateless web tier, and perhaps for some occasional modeling or research apps, but elsewhere the closest are racks of expensive, heavily-managed blade farms.
Those blade farms may help with operations, but since those farms are normally driven from the operations side of the enterprise, they don’t mean much to the apps. Consequently, these farms haven’t done much for scale for most apps.
Plus they’re still expensive.
Of course, they ARE most definitely commodity when compared with the Z-class mainframes that still dominate the batch settlement / customer service operations that are so prevalent in enterprises the world over.
A Financial Services Example
We have a financial services customer who decided to instantiate this thought experiment – they’ve implemented their settlement infrastructure on commodity. Commodity organized by an application fabric (ours!), so that it is reliable, arbitrarily scalable, and very cheap to operate.
The results? They’re matching industry norms for settlement performance on Z-class mainframes with a handful of commodity boxes … and they can keep scaling for a few hundred bucks at a time. Plus it’s reliable, and never gets more expensive to operate.
That will change their industry.
Back to the Thought Experiment
Over the past couple of years I keep running into organization after organization that has existing operations built on the constraints of expensive, heavy, traditional computing. Constrained by state, constrained by the data tier, constrained by I/O, constrained by budgets … but mostly constrained by human nature, by organization inertia, by just thinking about the problem the way it’s always been thought about.
Whole industries, for that matter.
Time to change that – ask yourself, what if computing is free?