Impressive factoids on Facebook and Hadoop

It’s common knowledge that Facebook runs Hadoop. The largest Hadoop cluster on the planet.

Here are some stats, courtesy of HighScalability, which scraped them from twitter during the Velocity conference:

  • 6 billion mobile messages every 30 minutes
  • 3.8 trillion cache operations in 30 minutes
  • 160m newsfeeds, 5bln realtime msgs, 10bln profile pics, 108 bln queries on mysql, all in 30 minutes

Now, some questions of interest:

  1. How close is the typical enterprise to that level of scale?
  2. How likely is it that a typical enterprise would be able to take advantage of such scale to improve their core business, assuming reasonable time and money budgets?

Let’s say you are CIO of a $500M financial services company.  Let’s suppose that you make an average of $10 per business transaction; and further suppose that each business transaction requires 24 database operations, including queries and updates.

At that rate, you’d run 50M*24 = about 1.2B database transactions … per year.

Scroll back up. What does Facebook do?  3.8B in 30 minutes.  Whereas 1.2B per year works out to be about 68,000 in 30 minutes. Facebook does 55,000 times as many database transactions as the hypothetical financial services company.

Now, let me repeat those questions:

  • If you run that hypothetical company, do you need Hadoop?
  • If you had Hadoop, would you be able to drive enough data through it to justify the effort of adoption?

 

Why do I believe Hadoop is not yet ready for Prime Time?

I am bullish on Hadoop and other NoSQL technologies. Long-term I believe they will be instrumental in providing quantum leaps in efficiency for existing businesses. But even more, I believe that mainstream BigData will open up brand new opportunities that were simply unavailable before. Right now we focus on applying BigData to user activity and clickstream analysis. why? Because that’s where the data is. But that condition will not persist. There will be oceans of structured and semi-structured data to analyze.  The chicken-and-egg situation with the tools and the data will evolve, and brand new application scenarios will open up.

So I’m Bullish.

On the other hand I don’t think Hadoop is ready for prime time today. Why? Let me count the reasons:

  1. The Foundations are not finished. The Hadoop community is still expending significant energy laying basic foundations.  Here’s a blog post from three months ago detailing the internal organization and operation of Hadoop 2.0.  Look at the raft of terminology this article foists on unsuspecting Hadoop novices: Applications Master, Application Manager (different!), Node Manager, Container Launch Context, and on and on.   And, these are all problems that have been previously solved; we saw similar Resource Management designs with EJB containers, and before that with antecedents like Transarc’s Encina Monitor from 1992, with its node manager, container manager, nanny processes and so on.  Developers (the users of Hadoop) don’t want or need to know about these details.
  2. The Rate of Change is still very high.  Versioning and naming is still in high flux.  0.20?  2.0?  0.23?  YARN?  MRv2?  You might think that version numbers are a minor detail but until the use of terminology and version numbers converges, enterprises will have difficulty adopting. In addition, the actual application model is still in flux. For enterprise apps, change is expensive and confusing. People cannot afford to attend to all the changes in the various moving targets.
  3. The Ecosystem is nascent.  There aren’t enough companies that are oriented around making money on these technologies.  Banks – a key adopter audience – are standing back waiting for the dust to settle. Consulting shops are watching and waiting.  As a broader ecosystem of  companies pops up and invests, enterprises will find it easier to get from where they are into the land of Hadoop.

 

Curt Monash on the Enterprise-Readiness of Hadoop

Curt Monash, writing on The DBMS2 blog, addressed the enterprise readiness of Hadoop recently.

tl/dr:

 Hadoop is proven indeed, whether in technology, vendor support, or user success. But some particularly conservative enterprises may for a while disagree.

But is Mr Monash really of one mind on the topic? especially considering that he began the piece with this:

Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.

So the answer to “is it enterprise ready?” seems to be, clearly, “Well, yes and no.”

With my understanding of the state of the tools and technology, and the disposition of enterprises, unlike Mr Monash I believe most enterprises don’t have the capacity or tolerance to adopt Hadoop currently. It seems to me that immaturity still represents an obstacle to new Hadoop deployments.

The Hadoop vendor companies and the Hadoop community at large are addressing that. They’re building out features and Hadoop 2.0 will bring a jump in reliability, but there is still significant work ahead  before the technology becomes acceptable to the mainstream.

 

Hadoop Adoption

Interesting. Michael Stonebraker, who has previously expressed skepticism regarding the industry excitement around Hadoop, has done it again.

Even at lower scale, it is extremely eco-unfriendly to waste power using an inefficient system like Hadoop.

Inefficient, he says!   Pretty strong words.  Stonebraker credits Hadoop for democratizing large-scale parallel processing. But he predicts that Hadoop will evolve radically to become a “true parallel” DBMS, or will be replaced.  He’s correct in noting that Google have moved away from MapReduce, in part.  Stonebraker describes some basic architectural elements of  MapReduce that, he says, represent significant obstacles for a large proportion of real-world problems.  He says that existing parallel DBMS systems have a performance advantage of 1-2 orders of magnitude over MapReduce. Wow.

It seems to me that, with Hadoop, companies are now exploring and exploiting the opportunity to keep and analyze massive quantities of data they had previously just discarded. If Stonebraker is right, they will try Hadoop, and then move to something else when they “hit the wall”.

I’m not so sure. The compounded results of steady development over time can bring massive improvements to any system. There is so much energy being invested in Hadoop that it would be foolhardy to discount its progress.

Companies used to “hit the wall” with simple so-called “2 tiered” RDBMS deployments.  But steady development over time, of hardware and software, has moved that proverbial wall further  and further out. JIT compilation and garbage collection used to be impractical for high-performance systems.  This is no longer true. And the same is true with any sufficiently developed technology.

As I’ve said before on this blog, I don’t think Hadoop and  MapReduce are ready today for broad, mainstream use.  That is as much a statement about the technology as it is about the people who are potential adopters.  On the other hand I do think these technologies hold great promise, and they can be exploited today by leading teams.

The big data genie is out of the bottle.

Big Data: Real benefits or Hype?

I’m a technologist. I believe technology, well utilized, can advance business goals. A business can derive a signficant advantage from making the right technology moves, exploiting information in just the right way.

But I am a bit skeptical of the excitement in the industry around Big Data, MapReduce, and Hadoop. While Google obviously has derived great benefit from MapReduce over the years, Google is special. Most businesses do not look like Google, and do not have information management requirements that are similar to Google’s. Google custom-constructs their PCs. At Google, the unit of computer hardware deployment is “the warehouse.”

If you underwrite insurance, or process medical records, or do scanning of transactions for fraud, or logistics optimization, or statistical process control, or any one of a variety of other typical business information tasks, your company is very much not like Google. If you don’t have hundreds of millions of users, generating billions of transactions, then you’re not like Google, and you should not try to emulate their technology strategy. Big Table is not for you, MapReduce is not something that will give you a strategic advantage.

Big Data seems to be the industry’s next touchstone.  Everyone feels they need th “check the box.”  There’s lots of interest by buyers, so vendors believe they need to talk about it. The tech press, with their persistently positive view of Google, encourages this. Breathless analyst reports fuel the flames. CS programs at universities teach MapReduce in 1st-year courses. Devs put MapReduce on their resume. All this combines to produce a self-reinforcing cycle.

But for most CIOs, MapReduce is a distraction.  In this view, I am persuaded by Dewitt, Stonebraker et al. CIOs should be focusing on figuring out how to better utilize the databases they already have.  Figure out cloud, and figure out how to improve management and governance of IT projects. Are you agile enough?  Are you doing Scrum?  Figure out what major pieces you can buy from your key technology partners.

I have read user stories of people using MapReduce to scan through log files, tens of gigabytes of log files.  Seriously?  Tens of gigabytes fits on a laptop hard-drvie. Unless you are talking about multiple terabytes of information, MapReduce is probably the wrong tool.

If you are doing analysis of the human genome, or weather modelling, or if you work for NSA or Baidu, then yes, you need MapReduce.  Otherwise, Big Data is not yet mainstream.