Sauce Labs explains its move from NoSQL CouchDB to old-skool MySQL

Sauce Labs has rendered a valuable service to the community by documenting the factors that went into a decision to change infrastructure – to replace CouchDB with MySQL.

Originally the company had committed  to CouchDB, which is a novel NoSQL store originating from a team out of MIT.  CouchDB is termed a “document store” and if you are a fan of REST and JSON, this is the NoSQL store for you.

Every item in CouchDB is a map of key-value pairs of arbitrarily deep nesting.  Apps retrieve objects via a clean REST api, and the data is JSON.  Very nice, easy to adopt, and, with the ubiquity of json parsers, CouchDB is easy to access from any language or platform environment. Speaking of old-school, I built a demo connecting from Classic ASP / Javascript to CouchDB – it was very easy to produce.  I also did a small client in PHP, C#, Python – all of them are 1st class clients in the world of CouchDB.

It really is a very enjoyable model for a developer or systems architect.

For Sauce Labs, though, the bottom line was – drumroll, please – that CouchDB was immature.  The performance was not good. Life with incremental indexes was … full of surprises.  The reliability of the underlying data manager was substandard.

Is any of this surprising?

And, MySQL is not the recognized leader in database reliability and enterprise readiness, which makes the move by Sauce Labs even more telling.

Building blocks of infrastructure earn maturity and enterprise-readiness through repeated trials.  Traditional relational data stores, even open source options, have been tested in real-world,  high-load, I-don’t-want-to-be-woken-up-at-4am scenarios. Apparently CouchDB has not.

I suspect something similar is true for other NoSQL technologies, including MongoDB, Hadoop, and Cassandra. I don’t imagine they would suffer from the same class of reliability problems reported by Sauce Labs. Instead, these pieces lack maturity and fit-and-finish in other ways. How difficult is it to partition your data? What are the performance implications of structuring a column family a certain way? What kind of network load should I expect for a given deployment architecture? These are questions that are not well understood in the NoSQL world.  Not yet.

Yes, some big companies run their businesses on Hadoop and other NoSQL products. But chances are, those companies are not like yours. They employ high-dollar experts dedicated to making those things hum. They’ve pioneered much of the expertise of using these pieces in high-stress environments, and they paid well for the privilege.

Is NoSQL ready for the enterprise?

Ready to be tested, yes. Ready to run the business?  Not yet.

In any case, it’s very valuable for the industry to get access to such public feedback. Thanks, Sauce Labs.

Impressive factoids on Facebook and Hadoop

It’s common knowledge that Facebook runs Hadoop. The largest Hadoop cluster on the planet.

Here are some stats, courtesy of HighScalability, which scraped them from twitter during the Velocity conference:

  • 6 billion mobile messages every 30 minutes
  • 3.8 trillion cache operations in 30 minutes
  • 160m newsfeeds, 5bln realtime msgs, 10bln profile pics, 108 bln queries on mysql, all in 30 minutes

Now, some questions of interest:

  1. How close is the typical enterprise to that level of scale?
  2. How likely is it that a typical enterprise would be able to take advantage of such scale to improve their core business, assuming reasonable time and money budgets?

Let’s say you are CIO of a $500M financial services company.  Let’s suppose that you make an average of $10 per business transaction; and further suppose that each business transaction requires 24 database operations, including queries and updates.

At that rate, you’d run 50M*24 = about 1.2B database transactions … per year.

Scroll back up. What does Facebook do?  3.8B in 30 minutes.  Whereas 1.2B per year works out to be about 68,000 in 30 minutes. Facebook does 55,000 times as many database transactions as the hypothetical financial services company.

Now, let me repeat those questions:

  • If you run that hypothetical company, do you need Hadoop?
  • If you had Hadoop, would you be able to drive enough data through it to justify the effort of adoption?

 

Why do I believe Hadoop is not yet ready for Prime Time?

I am bullish on Hadoop and other NoSQL technologies. Long-term I believe they will be instrumental in providing quantum leaps in efficiency for existing businesses. But even more, I believe that mainstream BigData will open up brand new opportunities that were simply unavailable before. Right now we focus on applying BigData to user activity and clickstream analysis. why? Because that’s where the data is. But that condition will not persist. There will be oceans of structured and semi-structured data to analyze.  The chicken-and-egg situation with the tools and the data will evolve, and brand new application scenarios will open up.

So I’m Bullish.

On the other hand I don’t think Hadoop is ready for prime time today. Why? Let me count the reasons:

  1. The Foundations are not finished. The Hadoop community is still expending significant energy laying basic foundations.  Here’s a blog post from three months ago detailing the internal organization and operation of Hadoop 2.0.  Look at the raft of terminology this article foists on unsuspecting Hadoop novices: Applications Master, Application Manager (different!), Node Manager, Container Launch Context, and on and on.   And, these are all problems that have been previously solved; we saw similar Resource Management designs with EJB containers, and before that with antecedents like Transarc’s Encina Monitor from 1992, with its node manager, container manager, nanny processes and so on.  Developers (the users of Hadoop) don’t want or need to know about these details.
  2. The Rate of Change is still very high.  Versioning and naming is still in high flux.  0.20?  2.0?  0.23?  YARN?  MRv2?  You might think that version numbers are a minor detail but until the use of terminology and version numbers converges, enterprises will have difficulty adopting. In addition, the actual application model is still in flux. For enterprise apps, change is expensive and confusing. People cannot afford to attend to all the changes in the various moving targets.
  3. The Ecosystem is nascent.  There aren’t enough companies that are oriented around making money on these technologies.  Banks – a key adopter audience – are standing back waiting for the dust to settle. Consulting shops are watching and waiting.  As a broader ecosystem of  companies pops up and invests, enterprises will find it easier to get from where they are into the land of Hadoop.

 

Curt Monash on the Enterprise-Readiness of Hadoop

Curt Monash, writing on The DBMS2 blog, addressed the enterprise readiness of Hadoop recently.

tl/dr:

 Hadoop is proven indeed, whether in technology, vendor support, or user success. But some particularly conservative enterprises may for a while disagree.

But is Mr Monash really of one mind on the topic? especially considering that he began the piece with this:

Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.

So the answer to “is it enterprise ready?” seems to be, clearly, “Well, yes and no.”

With my understanding of the state of the tools and technology, and the disposition of enterprises, unlike Mr Monash I believe most enterprises don’t have the capacity or tolerance to adopt Hadoop currently. It seems to me that immaturity still represents an obstacle to new Hadoop deployments.

The Hadoop vendor companies and the Hadoop community at large are addressing that. They’re building out features and Hadoop 2.0 will bring a jump in reliability, but there is still significant work ahead  before the technology becomes acceptable to the mainstream.

 

Does Metcalfe’s Law apply to Cloud Platforms and Big Data?

Cloud Platforms and Big Data – Have we reached a tipping point?  To think about this, I want to take a look back in history.

Metcalfe’s Law was named for Robert Metcalfe, one of the true internet pioneers, by George Gilder,  in an article that appeared in a 1993 issue of Forbes Magazine, it states that the value of a network increases with the square of the number of nodes.  It was named in the spirit of “Moore’s Law” – the popular aphorism attributed to Gordon Moore that stated that the density of transistors on a chip roughly doubles every 18 months. Moore’s Law succinctly captured why computers grew more powerful by the day.

With the success of “Moore’s Law”, people looked for other “Laws” to guide their thinking about a technology industry that seemed to grow exponentially and evolve chaotically, and “Metcalfe’s Law” was one of them.  That these “laws” were not really laws at all, but really just arguments, predictions, and opinions, was easily forgotten. People grabbed hold of them.

Generalizing a Specific Argument

Gilder’s full name for the “law” was “Metcalfe’s Law of the Telecosm”, and in naming it, he was thinking specifically of the competition between telecommunications network standards, ATM (Asynchronous Transfer Mode) and Ethernet.  Many people were convinced that ATM would eventually “win”, because of its superior switching performance, for applications like voice, video, and data.  Gilder did not agree. He thought ethernet would win, because of the massive momentum behind it.

Gilder was right about that, and for the right reasons. And so Metcalfe’s Law was right!  Since then though, people have argued that Metcalfe’s Law applies equally well to any network.  For example, a network of business partners, a network of retail stores, a network of television broadcast affiliates, a “network” of tools and partners surrounding a platform.  But generalizing Gilder’s specific argument this way is sloppy.

A 2006 article in IEEE Spectrum on Metcalfe’s Law says flatly that  the law is “Wrong”, and explains why:  not all “connections” in a network contribute equally to the value of the network.  Think of Twitter – most subscribers publish very little information, and to very limited circles of friends and family.  Twitter is valuable, and it grows in value as more people sign up, but Metcalfe’s Law does not provide the metaphor for valuing it. Or think of a telephone network: most people spend most of their time on the phone with around 10 people. Adding more people to that network does not increase the value of the network, for those people. Adding more people does not cause revenue to rise according to the  O(n2) metric implicit in Metcalfe’s Law.

Clearly the direction of the “law” is correct – as a network grows, its value grows faster.  We all feel that to be implicitly true, and so we latch on to Gilder’s aphorism as a quick way to describe it. But clearly also, the law is wrong generally.

Alternative “Laws” also Fail

The IEEE article tries to offer other valuation formulae, suggesting that the true value is not O(n2), but instead O(n*log(n)), and specifically suggests this as a basis for valuation of markets, companies, and startups.

That suggestion is arbitrary.  I find the mathematical argument presented in the IEEE article to be hand-wavey and unpersuasive. The bottom line is that networks are different, and there is not one law – not Metcalfe’s, nor Reed’s nor Zipf’s as suggested by the authors of that IEEE article – that applies generally to all of them. Metcalfe’s Law applied specifically but loosely to the economics of ethernet, just as Moore’s Law applied specifically to transistor density. Moore’s Law was not general to any manufacturing process, nor is Metcalfe’s Law general to any network.

Sorry, there is no “law”; One needs to understand the economic costs and potential benefits of a network, and the actual conditions in the market, in order to apply a value to that network.

Prescription: Practical Analysis

Ethernet enjoyed economic advantages in terms of cost of production and generalization of R&D development. Ethernet reached an economic tipping point, and beyond that other factors like improved switching performance of alternatives, were simply not enough to overcome the existing investment in tools, technology, and understanding of ethernet.

We all need to apply that sort of practical thinking to computing platform technology. For some time now, people have been saying that Cloud Platform technology is the Next Big Thing. There have been some skeptics, notably Larry Ellison, but even he has come around and is investing.

Cloud Platforms will “win” over existing on-premises platform options, when it makes economic sense to do so. In practice, this means, when tools for building, deploying, and managing systems in the cloud become widely available, and just as good as those that exist and are in wide use for on-premises platforms.

Likewise “Big Data” will win when it is simply better than using traditional data analysis, for mainstream data analysis workloads. Sure, Facebook and Yahoo use MapReduce for analysis, but, news flash: Unless you have 100m users, your company is not like Facebook or Yahoo. You do not have the same analysis needs. You might want to analyze lots of data, even terabytes of it. But the big boys are doing petabytes. Chances are, you’re not like them.

This is why Microsoft’s Azure is so critical to the evolution of Cloud offerings  Microsoft brought computing to the masses, and the company understands the network effects of partners, tools providers, and developers. It’s true that Amazon has a lead in cloud-hosted platforms, and it’s true that even today, startups prefer cloud to on-premises. But EC2 and S3 are still not commonly considered as general options by conservative businesses. Most banks, when revising their loan processing systems, are not putting EC2 on their short list. Microsoft’s work in bringing cloud platforms to the masses will make a huge difference in the marketplace.

I don’t mean to predict that Microsoft will “win” over Amazon in cloud platforms; I mean only to say that Microsoft’s expanded presence in the space will legitimize Cloud and make it much more accessible. Mainstream. It remains to be seen how soon or how strongly Microsoft will push on Big Data, and whether we should expect to see the same effect there.

The Bottom Line

Robert Metcalfe, the internet pioneer, himself apparently went so far as to predict that by 2013, ATM would “prevail” in the battle of the network standards. Gilder did not subscribe to such views. He felt that Ethernet would win, and Metcalfe’s Law was why. He was right.

But applying Gilder’s reasoning blindly makes no sense. Cloud and Big Data will ultimately “win” when they mature as platforms, and deliver better economic value over the existing alternatives.

 

Hadoop Adoption

Interesting. Michael Stonebraker, who has previously expressed skepticism regarding the industry excitement around Hadoop, has done it again.

Even at lower scale, it is extremely eco-unfriendly to waste power using an inefficient system like Hadoop.

Inefficient, he says!   Pretty strong words.  Stonebraker credits Hadoop for democratizing large-scale parallel processing. But he predicts that Hadoop will evolve radically to become a “true parallel” DBMS, or will be replaced.  He’s correct in noting that Google have moved away from MapReduce, in part.  Stonebraker describes some basic architectural elements of  MapReduce that, he says, represent significant obstacles for a large proportion of real-world problems.  He says that existing parallel DBMS systems have a performance advantage of 1-2 orders of magnitude over MapReduce. Wow.

It seems to me that, with Hadoop, companies are now exploring and exploiting the opportunity to keep and analyze massive quantities of data they had previously just discarded. If Stonebraker is right, they will try Hadoop, and then move to something else when they “hit the wall”.

I’m not so sure. The compounded results of steady development over time can bring massive improvements to any system. There is so much energy being invested in Hadoop that it would be foolhardy to discount its progress.

Companies used to “hit the wall” with simple so-called “2 tiered” RDBMS deployments.  But steady development over time, of hardware and software, has moved that proverbial wall further  and further out. JIT compilation and garbage collection used to be impractical for high-performance systems.  This is no longer true. And the same is true with any sufficiently developed technology.

As I’ve said before on this blog, I don’t think Hadoop and  MapReduce are ready today for broad, mainstream use.  That is as much a statement about the technology as it is about the people who are potential adopters.  On the other hand I do think these technologies hold great promise, and they can be exploited today by leading teams.

The big data genie is out of the bottle.

Big Data: Real benefits or Hype?

I’m a technologist. I believe technology, well utilized, can advance business goals. A business can derive a signficant advantage from making the right technology moves, exploiting information in just the right way.

But I am a bit skeptical of the excitement in the industry around Big Data, MapReduce, and Hadoop. While Google obviously has derived great benefit from MapReduce over the years, Google is special. Most businesses do not look like Google, and do not have information management requirements that are similar to Google’s. Google custom-constructs their PCs. At Google, the unit of computer hardware deployment is “the warehouse.”

If you underwrite insurance, or process medical records, or do scanning of transactions for fraud, or logistics optimization, or statistical process control, or any one of a variety of other typical business information tasks, your company is very much not like Google. If you don’t have hundreds of millions of users, generating billions of transactions, then you’re not like Google, and you should not try to emulate their technology strategy. Big Table is not for you, MapReduce is not something that will give you a strategic advantage.

Big Data seems to be the industry’s next touchstone.  Everyone feels they need th “check the box.”  There’s lots of interest by buyers, so vendors believe they need to talk about it. The tech press, with their persistently positive view of Google, encourages this. Breathless analyst reports fuel the flames. CS programs at universities teach MapReduce in 1st-year courses. Devs put MapReduce on their resume. All this combines to produce a self-reinforcing cycle.

But for most CIOs, MapReduce is a distraction.  In this view, I am persuaded by Dewitt, Stonebraker et al. CIOs should be focusing on figuring out how to better utilize the databases they already have.  Figure out cloud, and figure out how to improve management and governance of IT projects. Are you agile enough?  Are you doing Scrum?  Figure out what major pieces you can buy from your key technology partners.

I have read user stories of people using MapReduce to scan through log files, tens of gigabytes of log files.  Seriously?  Tens of gigabytes fits on a laptop hard-drvie. Unless you are talking about multiple terabytes of information, MapReduce is probably the wrong tool.

If you are doing analysis of the human genome, or weather modelling, or if you work for NSA or Baidu, then yes, you need MapReduce.  Otherwise, Big Data is not yet mainstream.