Over the past couple of years we’ve come across application after application where the biggest block to being able to utilize a cloud (public or private) has been the relational database. This usually goes hand in hand with an inability to scale.
In a very (Monty) Python-esque manner RDBMS proponents proclaim “I’m not dead yet”, and the thing is that they’re absolutely right … but they are beginning to ail a bit.
There are many reasons for this, but mostly the cracks in the RDBMS monolith have a very simple explanation – the relational database is only one – albeit one powerful and long-lived – storage abstraction., and is not the best choice for many common, everyday problems.
We Can Do Better
In other words, there are many common, everyday problems in which the data can be effectively stored, managed, and retrieved in other abstractions …
… perhaps abstractions that fit the problems themselves much better, and may well be far more “cloud friendly”.
If you’ll grant me that for now (and we will get back to this question in future posts), then the next question is inevitable:
for those applications where the RDBMS is not the best choice, what else should I consider?
A Great Question
As life would have it, I think for anyone thinking about cloud-friendly storage abstractions this is the real question. Several friends (including @KentLangley and my colleague @msgroner) pointed out an excellent overview post by last.fm founder Richard Jones. From the intro:
Perhaps you’re considering using a dedicated key-value or document store instead of a traditional relational database. Reasons for this might include:
1. You’re suffering from Cloud-computing Mania.
2. You need an excuse to ‘get your Erlang on’
3. You heard CouchDB was cool.
4. You hate MySQL, and although PostgreSQL is much better, it still doesn’t have decent replication. There’s no chance you’re buying Oracle licenses.
5. Your data is stored and retrieved mainly by primary key, without complex joins.
6. You have a non-trivial amount of data, and the thought of managing lots of RDBMS shards and replication failure scenarios gives you the fear.
At any rate the post by Jones is a good place to get started on non-relational basic name-value stores.
This whole business of storing data in cloud-scale apps has become one of my favorite area for discussion – I can promise you more posts to come!
For the past year+ there have been many indicators that fundamental changes in the enterprise software development market are well underway. In particular, it sure seemed like the monolithic predominance of traditional JEE app servers was starting to break up.
A few weeks ago I posted about the rise of Tomcat, and talked about why it is now the leader for deployment of Spring apps. (the why is easy – it’s simple, cheap, easy to use, and works well).
Now Rod Johnson (springsource) has another interesting observation – job postings requiring Spring skills have surpassed those requiring EJB on at least one site.
Indeed.com shows that in November, 2007, Spring overtook EJB as a skills requirement for Java job listings. As of yesterday, the respective job numbers were 5710 for Spring against 5030 for EJB.
… While it’s not an apples-to-apples comparison, it is reasonable to consider Spring and EJB as alternatives for the core component model in enterprise Java applications. And it’s clear which is now in the ascendancy.
… Frankly, the EJB era was an aberration.
What This Means
EJB is often inexorably intertwined with the decision to use monolithic, heavy, traditional app servers on heavy, costly infrastructure.
The trend towards breaking this monolith up does start with the core object model, and Spring is proving fairly prominent in this role. Once this decision is made, then separate decisions can now be made about how to achieve scale, reliability, and operational integrity.
If the applications are not so demanding, then not much more than Tomcat and a few operational tools are required. Many early cloud deployments probably fit this category.
Need for Scale and More
Once the app has more demanding scale, reliability, performance, and other needs then the developer has been faced with a couple of choices. These include
- Deploy the new lightweight app on a traditional app server & infrastructure.
- Write their own state management, coordination, and operational tools to deploy on lighter infrastructure
- Pick a new approach for development and deployment facilities.
It is in this third option that most of the interesting innovation is occurring. That is where our excellent execution model (simple abstraction, multi language, multi OS, highly scalable, reliable and fast), elegant state facilities (lightweight, reliable process flows & spaces), and very simple operational model (the biggest fabric is the same thing to operate as a single server) make a lot of sense.
That all of this can occur on a truly commodity infrastructure (from Tomcat to Linux to grids of uber-cheap commodity processors) is a real bonus.
Best part? You can bring your Spring app over as-is, and gain much of the Appistry goodness from day one.
Last week at the HPC on Wall Street conference (it’s a really nice one-day format … hope this is a growing trend!) I helped with the keynote panel entitled What’s Hot in HPC (audio here). Along the way I mentioned in passing that MPI is dead. Dead as in assuming room temperature, dead as in let’s write an epitaph, dead as in maybe it’d be polite to start talking about something much more useful.
But wait a minute, you may ask … what about all of those apps built with MPI, what about OpenMPI, what about all of true-blue supercomputing elite (taking their cue from Monty Python’s In Search of the Holy Grail) who say that not only is it not dead, it’s not even sick yet?
The funny thing is, that a panel or two later a well-meaning panelist from Intel Research (but clearly not from Intel Application Development!) said just that … that I had clearly not intended to say that MPI was dead, surely I couldn’t possibly have meant to imply that at all. In fact, it’s not dead yet … it’s getting better.
Actually, that’s exactly what I meant to say. In case there’s any doubt, MPI IS DEAD!
How Can I Say This?
Actually, this is a relatively straight-forward to see. In the bad old days all true High Performance Computing was done with some combination of shared memory, pipelined processors, and lots of threads (in the really bad old days all of this was done with hand-crafted machine code, but let’s keep this blog suitable for reading with your family!)
At some point various forms of HPC clusters and grids arose, and apps were written using messaging primitives. Since that was a little hard to do from some of the languages common in the hpc world, folks developed MPI as a friendly way to use messaging primitives.
Unfortunately, no amount of api-magic could change the simple fact that writing code to execute coherently across an arbitrary number of independent processors (with or without shared memory) is enormously difficult.
Done well the resulting applications can run quickly and achieve results that could never be achieved with a small number of threads on a small number of processors … yet it’s so very hard to “do these well”. Hard as in takes a long time, hard as in easy to make mistakes. In fact, the skills needed to write high-quality messaging-based applications are on par with operating system development writing your own processor microcode. Possible? Sure. Satisfying when you finally get it to work? Uh-huh. Desirable? Only if there’s no other choice.
The most irritating thing is that this bog of forgotten application architectures is precisely where HPC application development has been stuck for the past 20+ years. The second most irritating part of all this is, for the vast majority of applications, it’s simply totally unnecessary.
Totally Unnecessary? Are You Kidding Me?
Nope, that is the simple truth.
True, there are some algorithms and applications for which developing shared, coherent state and operating on that state is best done with fine-grained messaging, perhaps augmented with some state mechanism. But most of what has been implemented in MPI has been built that way because, well, mostly because that’s the way they were built.
Many of the applications that I see are much easier to do by first looking for the data fissures … that is, looking for places in which the data has inherent independence. Once that has been identified, then just bundle the data into multiple, individually simple service requests, collect all the results and you’re done.
Of course you want to put these applications on our application fabric, so that everything is scalable, reliable, easy to use and operate. You can even make use of data-grid facilities (like our process-flow and FAM mechanisms) to keep any state needed (also with full fabric attributes – scalable, reliable, easy to use and operate).
Sure, if you have one of those problems that just can’t be done any other way then go ahead and use MPI (or it’s more “modern” derivative, OpenMPI), or if you have an existing MPI applications at least move it to a fabric to help with operations. But unless your goal is to show just how good your architects and developers are, for most problems you’d be far better to just leave MPI behind and move fully into our world.
One More Thing
Application patterns like map-reduce and scatter-gather take this simplification a step further. In the future I’ll post on how our fabric is an ideal substrate for map-reduce and scatter-gather applications, but in the meantime whenever somebody asks you to use MPI … Just Say No!