MongoDB booster would prefer Cassandra, if only he could store JSON in it. Have I got a data store for you!

Interesting article at GigaOM interviewing MongoLab Founder and CEO Will Shulman. GigaOM reports:

MongoLab operates under a thesis that MongoDB is pulling away as the world’s most-popular NoSQL database not because it scales the best — it does scale, Shulman said, but he’d actually choose Cassandra if he just needed a multi-petabyte data store without much concern over queries or data structure — but because web developers are moving away from the relational format to an object-oriented format.

Interesting comment. My spin alarm went off with the fuzz-heavy phrasing “…operates under a thesis…” I’ll buy that developers are moving away from relational and towards simpler data storage formats that are easier to use from dynamic scripting languages. But there is no evidence presented in support of the conclusion that “MongoDB is pulling away.” GigaOM just says that this is MongoLab’s “thesis”.

In any case, the opinion of Shulman that Cassandra scales much better than MongoDB leads to this question: If the key to developer adoption is providing the right data structures, then why not just build the easy-to-adopt object store on the existing proven-to-scale backend? Why build another backend if that problem has been solved by Cassandra?

Choosing to avoid this question, the creators of MongoDB have only caused people to ask it more insistently.

The combination of developer-friendly data structure and highly-scalable backend store has been done. You can get the scale of Cassandra and the easy of use of a JSON-native object store. The technology is called App Services, and it’s available from Apigee.

In fact, App Services even offers a network interface that is wire-compatible with existing MongoDB clients (somebody tell Shulman); you can keep your existing client code and just point it to App Services.

With that you can get the nice data structure and the vast scalability.

Thank you, Ed Anuff.

NoSQL is apparently NOT going to deliver World Peace

Peter Wayner at InfoWorld has articulated “Seven Hard Truths” about NoSQL technologies. Among them:

  • It’s nice to have JOINs; NoSQL has none.
  • It’s nice to have transactions
  • After 30 years of development, it seems that SQL Databases have some solid features, like the query analyzer.
  • NoSQL is like the Wild West; SQL is civilization
  • Gee, there sure are a lot of tools oriented toward SQL Databases.

Intereesting synthesis, but nothing really novel here. Michael Stonebraker articulated these things in 2009, and lots of people who’ve built information-driven companies in the past 6 or 7 years on traditional SQL datastores had the same insight, even if they didn’t bother to articulare it this way.

SQL Databases work. There are lots of tools, people know how to use and optimize them. The metaphor is well understood, the products are mature, the best practices are widely disseminated throughout the industry. None of these are true with the big NoSQL technologies.

There is value in NoSQL. Some very successful companies have considered SQL stores and dismissed them as inappropriate or inadequate to their tasks. Facebook, Google and Twitter have famously used NoSQL to accomplish things that would not be possible with technology that has evolved to serve the needs of traditional transaction-oriented enterprises.

Ironically, the shiny object that is NoSQL has now captured the attention of IT people in traditional enterprises, the very audience that the designers of NoSQL technologies ignored when producing their solutions. Does this make sense?

Yes, there’s a place for NoSQL. No, it will not take over the world, or replace the majority of enterprise data storage needs, anytime soon.  There are opportunities to take advantage of new technologies, but unless you are the next Twitter (and let’s face it, you’re not…) you probably do not need to emulate the Twitter data architecture. What you should do is combine your existing SQL data strategy with small doses of NoSQL, deployed tactically where it makes sense.


Sauce Labs explains its move from NoSQL CouchDB to old-skool MySQL

Sauce Labs has rendered a valuable service to the community by documenting the factors that went into a decision to change infrastructure – to replace CouchDB with MySQL.

Originally the company had committed  to CouchDB, which is a novel NoSQL store originating from a team out of MIT.  CouchDB is termed a “document store” and if you are a fan of REST and JSON, this is the NoSQL store for you.

Every item in CouchDB is a map of key-value pairs of arbitrarily deep nesting.  Apps retrieve objects via a clean REST api, and the data is JSON.  Very nice, easy to adopt, and, with the ubiquity of json parsers, CouchDB is easy to access from any language or platform environment. Speaking of old-school, I built a demo connecting from Classic ASP / Javascript to CouchDB – it was very easy to produce.  I also did a small client in PHP, C#, Python – all of them are 1st class clients in the world of CouchDB.

It really is a very enjoyable model for a developer or systems architect.

For Sauce Labs, though, the bottom line was – drumroll, please – that CouchDB was immature.  The performance was not good. Life with incremental indexes was … full of surprises.  The reliability of the underlying data manager was substandard.

Is any of this surprising?

And, MySQL is not the recognized leader in database reliability and enterprise readiness, which makes the move by Sauce Labs even more telling.

Building blocks of infrastructure earn maturity and enterprise-readiness through repeated trials.  Traditional relational data stores, even open source options, have been tested in real-world,  high-load, I-don’t-want-to-be-woken-up-at-4am scenarios. Apparently CouchDB has not.

I suspect something similar is true for other NoSQL technologies, including MongoDB, Hadoop, and Cassandra. I don’t imagine they would suffer from the same class of reliability problems reported by Sauce Labs. Instead, these pieces lack maturity and fit-and-finish in other ways. How difficult is it to partition your data? What are the performance implications of structuring a column family a certain way? What kind of network load should I expect for a given deployment architecture? These are questions that are not well understood in the NoSQL world.  Not yet.

Yes, some big companies run their businesses on Hadoop and other NoSQL products. But chances are, those companies are not like yours. They employ high-dollar experts dedicated to making those things hum. They’ve pioneered much of the expertise of using these pieces in high-stress environments, and they paid well for the privilege.

Is NoSQL ready for the enterprise?

Ready to be tested, yes. Ready to run the business?  Not yet.

In any case, it’s very valuable for the industry to get access to such public feedback. Thanks, Sauce Labs.

Why do I believe Hadoop is not yet ready for Prime Time?

I am bullish on Hadoop and other NoSQL technologies. Long-term I believe they will be instrumental in providing quantum leaps in efficiency for existing businesses. But even more, I believe that mainstream BigData will open up brand new opportunities that were simply unavailable before. Right now we focus on applying BigData to user activity and clickstream analysis. why? Because that’s where the data is. But that condition will not persist. There will be oceans of structured and semi-structured data to analyze.  The chicken-and-egg situation with the tools and the data will evolve, and brand new application scenarios will open up.

So I’m Bullish.

On the other hand I don’t think Hadoop is ready for prime time today. Why? Let me count the reasons:

  1. The Foundations are not finished. The Hadoop community is still expending significant energy laying basic foundations.  Here’s a blog post from three months ago detailing the internal organization and operation of Hadoop 2.0.  Look at the raft of terminology this article foists on unsuspecting Hadoop novices: Applications Master, Application Manager (different!), Node Manager, Container Launch Context, and on and on.   And, these are all problems that have been previously solved; we saw similar Resource Management designs with EJB containers, and before that with antecedents like Transarc’s Encina Monitor from 1992, with its node manager, container manager, nanny processes and so on.  Developers (the users of Hadoop) don’t want or need to know about these details.
  2. The Rate of Change is still very high.  Versioning and naming is still in high flux.  0.20?  2.0?  0.23?  YARN?  MRv2?  You might think that version numbers are a minor detail but until the use of terminology and version numbers converges, enterprises will have difficulty adopting. In addition, the actual application model is still in flux. For enterprise apps, change is expensive and confusing. People cannot afford to attend to all the changes in the various moving targets.
  3. The Ecosystem is nascent.  There aren’t enough companies that are oriented around making money on these technologies.  Banks – a key adopter audience – are standing back waiting for the dust to settle. Consulting shops are watching and waiting.  As a broader ecosystem of  companies pops up and invests, enterprises will find it easier to get from where they are into the land of Hadoop.


Is Amazon DynamoDB the look of the future?

Amazon is now offering a key/value data store that relies on Solid state disc for storage. DynamoDB is the name, and it is intended to complement S3 as a lower-latency store. It’s higher cost, but offers better performance for those customers that need it.

Two things on this.

  1. The HighScalability blog calls Amazon’s use of SSD as a “radical step.”  That view may become antiquated rapidly.The one outlier in the datacenter of today is the use of spinning mechanical platters as a basis to store data. Think about that. There’s one kind of moving part in a datacenter – the disk.  It consumes much of the power and causes much of the heat. We will see SSD replace mag disk as a mainstream storage technology, sooner than most of us think.  Companies like Pure Storage will lead the way but EMC and the other big guys are not waiting to get beaten.Depending on manufacturing ramp-up, this could happen in 3-5 years. It won’t be radical. The presence of spinning platters in a datacenter will be quaint in 8 years.
  2. The exposure of the underlying storage mechanism to the developer is a distraction.  I don’t want to have to choose my data programming model based on my latency requirements.  I don’t want to know, or care, that the storage relies on an SSD. That Amazon exposes it today is a temporary condition, I think. The use of the lower-latency store ought to be dynamically determined by the cloud platform itself, based on provisioning configuration provided by the application owner.  Amazon will just fold this into its generic store. Other cloud platform providers will follow.The flexibility and intelligence allowed in provisioning the store is the kind of thing that will provide the key differentiation among cloud platforms.