Hadoop Adoption

Interesting. Michael Stonebraker, who has previously expressed skepticism regarding the industry excitement around Hadoop, has done it again.

Even at lower scale, it is extremely eco-unfriendly to waste power using an inefficient system like Hadoop.

Inefficient, he says!   Pretty strong words.  Stonebraker credits Hadoop for democratizing large-scale parallel processing. But he predicts that Hadoop will evolve radically to become a “true parallel” DBMS, or will be replaced.  He’s correct in noting that Google have moved away from MapReduce, in part.  Stonebraker describes some basic architectural elements of  MapReduce that, he says, represent significant obstacles for a large proportion of real-world problems.  He says that existing parallel DBMS systems have a performance advantage of 1-2 orders of magnitude over MapReduce. Wow.

It seems to me that, with Hadoop, companies are now exploring and exploiting the opportunity to keep and analyze massive quantities of data they had previously just discarded. If Stonebraker is right, they will try Hadoop, and then move to something else when they “hit the wall”.

I’m not so sure. The compounded results of steady development over time can bring massive improvements to any system. There is so much energy being invested in Hadoop that it would be foolhardy to discount its progress.

Companies used to “hit the wall” with simple so-called “2 tiered” RDBMS deployments.  But steady development over time, of hardware and software, has moved that proverbial wall further  and further out. JIT compilation and garbage collection used to be impractical for high-performance systems.  This is no longer true. And the same is true with any sufficiently developed technology.

As I’ve said before on this blog, I don’t think Hadoop and  MapReduce are ready today for broad, mainstream use.  That is as much a statement about the technology as it is about the people who are potential adopters.  On the other hand I do think these technologies hold great promise, and they can be exploited today by leading teams.

The big data genie is out of the bottle.

Is Amazon DynamoDB the look of the future?

Amazon is now offering a key/value data store that relies on Solid state disc for storage. DynamoDB is the name, and it is intended to complement S3 as a lower-latency store. It’s higher cost, but offers better performance for those customers that need it.

Two things on this.

  1. The HighScalability blog calls Amazon’s use of SSD as a “radical step.”  That view may become antiquated rapidly.The one outlier in the datacenter of today is the use of spinning mechanical platters as a basis to store data. Think about that. There’s one kind of moving part in a datacenter – the disk.  It consumes much of the power and causes much of the heat. We will see SSD replace mag disk as a mainstream storage technology, sooner than most of us think.  Companies like Pure Storage will lead the way but EMC and the other big guys are not waiting to get beaten.Depending on manufacturing ramp-up, this could happen in 3-5 years. It won’t be radical. The presence of spinning platters in a datacenter will be quaint in 8 years.
  2. The exposure of the underlying storage mechanism to the developer is a distraction.  I don’t want to have to choose my data programming model based on my latency requirements.  I don’t want to know, or care, that the storage relies on an SSD. That Amazon exposes it today is a temporary condition, I think. The use of the lower-latency store ought to be dynamically determined by the cloud platform itself, based on provisioning configuration provided by the application owner.  Amazon will just fold this into its generic store. Other cloud platform providers will follow.The flexibility and intelligence allowed in provisioning the store is the kind of thing that will provide the key differentiation among cloud platforms.

What does it mean? Microsoft upping efforts on Infrastructure-as-a-service

Wired is reporting a rumor that Microsoft will soon launch a new Infrastructure-as-a-service offering to compete with Amazon EC2, in June.

What Does it Mean?

I have no idea whether the “rumor” is true, or even what it really means. I speculate that the bottom line is that we’ll be able to upload arbitrary VHDs to Azure. Right now Microsoft allows people to upload VHDs that run Windows Server 2008.  With this change they may support “anything”.  Because it’s a virtual hard drive, and the creator of that hard drive has full control over what goes into it, that means an Azure customer will be able to provision VMs in the Microsoft cloud that run any OS, including Linux. This would also represent a departure from the stateless model that Windows Azure currently supports for the VM role. It means that VHDs running in the Windows Azure cloud will be able to save local state across stop/restart.

Should we be Surprised?

Is this revolutionary?  Windows Azure already offers compute nodes; it’s beta today but it’s there, and billable.  So there is some degree of Infrastructure-as-a-service capability today.

For my purposes “infrastructure as a service”  implies raw compute and storage, which is something like Amazon’s EC2 and S3. A “platform as a service” walks up the stack a little, and offers some additional facilities for use in applications. This might include application management and monitoring, enhancements to the storage model or service, messaging, access control, and so on. All of those are general-purpose things, usable in a large variety of applications, and we’d say they are “higher level” than storage and compute. In fact those services are built upon the compute+storage baseline.

For generations in the software business, Microsoft has been a major provider of platforms. With its launch in 1990, Windows itself was arguably the first broadly adopted “application platform”.  Since the early 90’s, specialization and evolution have resulted in an proliferation of platforms in the industry – we have client platforms, server platforms (expanding to include the Hypervisor), web platforms (IIS+ASP.NET, Apache+PHP), data platforms, mobile platforms and so on. And beyond app platforms, since Dynamics Microsoft has also beein in the business of offering applications as well, and it’s here we see the fractal nature of the space.  The applications can act as platforms for a particular set of extensions.  In any case, it’s clear that Microsoft has offerings in all those spaces, and more.

Beneath the applications like Dynamics, and beneath the traditional application platforms like Windows + SQL Server + IIS + .NET, Microsoft has continued to deliver the foundational infrastructure, specifically to enable other alternative platforms. Oracle RDBMS and Tomcat running on Windows is a great example of what I mean here. Sure, Microsoft would like to entice customers to adopt the entirety of their higher-level platforms, but the company is willing to make money by supplying lower-level infrastructure for alternative platforms.

Considering that history, the rumor that Microsoft is “upping efforts on infrastructure as a service” should not be surprising.  Microsoft has long provided offerings at many levels of “the stack”.  I expect that customers have clearly told Microsoft they want to run VHDs, just like on EC2, and Microsoft is responding to that.  Not everyone will want this; most people who want this will also want higher-level services.  I still believe strongly in the value of higher-level cloud-based platforms.

Platform differentiation in the Age of Clouds

It used to be that differentiation in server platforms was dominated by the hardware. There were real, though fluctuating and short-lived, performance differences between Sun’s Sparc, HP’s PA-RISC, IBM’s RIOS and Intel’s x86. But for the moment, the industry has found an answer to the hardware question; servers use x64.

With standard high volume servers, the next dominant factor for differentiation was on the application programming model.  We had a parade of players like CORBA, COM, Java, EJB, J2EE, .NET. More recently we have PHP, node.js, Ruby, and Python. The competition in that space has not settled on a single, decisive winner, and in my judgment, that is not likely to happen. Multiple viable options will remain, and the options that enjoy relatively more success do share some common attributes: ease of programming (eg, building an ASPNET service or a PHP page) is favored over raw performance (building an ISAPI or an Apache module in C/C++).  Also, flexibility of the model (JSP/Tomcat/RESTlets) is favored over more heavily prescriptive metaphors (J2EE). I expect the many options in server platform space to continue; the low-cost to develop and extend these platform alternatives means there is no natural economic value of convergence, as there was in server hardware where the R&D costs are relatively high.

Every option in the space will offer its own unique combination of strengths, and enterprises will choose among them. One company might prefer strong support for running REST services, while another might prefer the application metaphor  of Ruby on Rails.  Competition will continue.

But programmer-oriented features will not be the key differentiator in the world of cloud-hosted platforms. Instead, I expect to see operational and deployment issues to dominate.

  • How Reliable is the service?
  • How difficult is it to provision a new batch of servers?
  • How flexible is the hosting model? Sometimes I want raw VMs, sometimes I want higher-level abstractions. I might want to manage a “farm” of servers at a time, or even better, I might want to manage my application without regard for how many VMs back it.
  • How extensive are the complementary services, like access control, messaging, data, analysis, and so on.
  • What kind of operational data do I get out of that farm of servers? I want to see usage statistics and patterns of user activity.

It won’t be ease of development that wins the day.

Amazon has been very disruptive with its AWS, and Microsoft is warming to the competition. This is all good news for the industry. It means more choices, better options, and lower costs, all of which promotes innovation.

HTTP apps? REST? JSON? XML? AJAX? Fiddler is invaluable

For developers, having access to and knowing how to use the proper tools is invaluable.  For any sort of communication application, I find Fiddler2 to be indispensable.  It is an “HTTP Debugging Proxy”, but ignore that – the main point is that it lets a developer or network engineer see the HTTP request and response messages, which means you can get a real understanding of what’s happening.  It’s WireShark for HTTP.

As an added bonus, in sniffs SSL traffic, and can transparently display JSON or XML in an appropriate human-friendly manner.

The name Fiddler refers to the ability to “fiddle” with requests as they go out, or responses as they arrive. This can be really helpful in trying what-if scenarios.

Free, too.  Thanks to EricLaw for building it.

I want to point out: this tool is not commercial, there’s no training course for it, there’s no vendor pushing it. It illustrates that developers need to train themselves, to stay current and productive. They need to keep their eyes open and add to their skills continuously, in order to stay valuable to their employers.

Apigee’s Best Practices for REST API design

I just read Apigee’s paper on pragmatic RESTful API design.

Very sensible, practical guidance. Good stuff for organizations confronting the REST phenomenon.  There are obviously many REST-based interfaces out there. Facebook, Google, Digg, Reddit, LinkedIn are just a few of the more visible services, coincidentally all social networks, that support REST.  But of course there is real value for enterprises in exposing resources in the same way. Wouldn’t it be nice if public records would be exposed by your municipal government via REST?  How many times have you wanted the data from a hosted app – what we used to call “application service provider” – in a machine-comprehensible format, instead of in an HTML page?

It’s worth examining the results the pioneers have achieved, to benefit from their experience.

As pioneers rushing to market, the designers of these early social network APIs may have sacrificed some quality in design, for speed of delivery.  Understandable. Apigee’s paper critiques some of those designs, and describes some of the rough edges. It’s like sitting in on a design review – and it’s an excellent way to learn.

Once you “get” REST, it all makes sense. It falls into place and the design principles and guidance offered by Apigee will seem like second nature. But for those grappling with a novel problem, it’s good to have a firm foundation from which to start.


What drives the demand for continuous change?

Lately, it seems, no system is ever “finished”.  You are only running “this week’s build”.  And this is how we want it!  What drives the demand for continuous evolution of information systems?

In my opinion, it’s the possibilities. The possibility for interconnections among disparate systems, stakeholders, and devices. The model of exteme interconnectivity is enabled through standard protocols and data formats, and it is the single most striking change in IT from 4 years ago. There was a time when you needed to buy your CPU and your hard disk drive from the same manufacturer, or they wouldn’t work together. And can you belive we actually had vendor-specific networking technology?  Does nayone remember DecNet and IBM’s Token Ring?!

Just ten years ago, Scott McNealy, then CEO of Sun Microsystems, was criticizing .NET as “Not yet” or “Dot Not”. His line was that .NET was a “lock in” strategy. Lock in!  Remember that?  Java was proposed as the way to avoid “vendor lock in”.  Does anyone really think about vendor lock-in any more?

We have come a long, long way. Rather than worrying about evading vendor leverage, CIOs are interested in proactively solving business problems, and they realize that means interconnecting disparate systems. It means buying what they can buy, and building the rest, and forging as many connections as the business needs.  It means relying on JSON, XML and REST – messaging, rather than elegant distributed object models like CORBA or Java everywhere, as the preferred way to connect systems.

The interconnectivity enabled by that practical approach is the impetus for continuous change.

Open standards and defined data formats allow the interconnections that produced the explosion in possibilities for building software systems. Any developer today can perform lookups on Google’s data services to do geolocation.  It is straightforward to use Bing maps to display a color-coded map of sales results by country, or a map of the density of clients by county.  This stuff was exotic or expensive just a few years ago, and now, because we can interconnect systems so easily, the state of the art has advanced to the point where the business demands this sort of analysis and intelligence.  Look at Tableau Software – they are a terrific example of a company exploiting this trend.

Analysis of business data right now. When a new opportunity opens up, I want to be able to analyze that, right now.

But there is still so much more upside. Just the other day I was speaking to a sales manager who bemoaned the inability of his IT staff to produce a report he needed. But that situation is as unnecessary as it is frustrating.  Why is he relying on someone else to produce his reports? He should have access to his own business data, the way he wants it!  He should have desktop business intelligence. He shouldn’t have to wait for his monthly staff meeting to see the data.

There’s lots more to come.


Sideways Scrum

In a previous post, I talked about the change in software development approaches over the past 15 years. It has been slow, but in aggregate, the effect is striking. People are doing iterative development now, and succeeding with it. But despite the growing body of evidence in support of these approaches, in some cases it’s still difficult to get an organization to adopt them, to follow along.

There are lots of approaches to get an organization to move – there are course in MBA programs designed around this very problem. With the opportunity for efficiency with iterative management, managers need to take the lead.

I have a story of one such reluctant organization. This is a successful business, it’s been running for years on a business system that was created in-house. The business is going well, they weathered the 2008-10 downturn well and are now looking to grow. They’re nicely profitable and well-managed. They know they need a new business information system to support the growth they’re seeking. They’re willing to commit significant capital investment to produce this new system. All sounds good, right?

The problem is this: while the business leaders are creative and energetic, and interested in aggressively pursuing the business opportunity they see, the IT and development staff in the organization are reluctant. They have done waterfalls for so long, that they eat, sleep, and breathe formal requirements analysis documents. They are accustomed to large meetings, building consensus, and deliberative efforts. This old-school approach results, frankly, in an inability to execute. The engineering team cannot seem to make any progress on the project, and if they are actually making progress, they have no good way of demonstrating that to the boss, the guy who is paying for the effort. It’s frustrating, and the boss is not so sure that his investment in the new business system is being managed wisely.

Often, strategy and business leaders are inclined to delegate – to let the development teams do what they do, and that’s just what was happening in this case. They stood back, figuring that participating too actively in the development effort would be distracting.  But the business leaders were unsatisfied with the progress on the project. This was the catalyst for action.

At first they tried to persuade the dev teams to get more Scrummy, more agile. There was some grudging movement, but no real commitment. Progress was still slow. Meetings continued to be painfully large and inefficient.

Finally the boss, the CEO, decided to get tough.  He insisted on a status update meeting, every other week.  He wanted the meeting to be 30 minutes only, and he wanted to focus on measuring and demonstrating progress. He insisted that the meeting be attended by only a small number of people, and that the dev team conceive and use a clear metric for measuring its progress that they could show him during these meetings. He insisted on seeing actual working demonstrations of the application.  

None of this fit with the path the dev team was intent on.  But, notably, it all fit very well with Scrum.  The demand for measurable progress is characterized in a burndown chart. The working demonstrations requires a sprint-like approach.  The small cross-functional team is right out of scrum. He’s getting scrum sideway.

Now, this isn’t a perfect situation. It would be nice if the organization could adopt scrum “openly”. It’s almost as if the CEO had to sneak it in sideways.  But, this may just work.  His demands for this sort of status update, with regular progress reports and demonstratable code, has moved the dial. It may be the thing to convince the conservative dev team that it is safe to be more agile.

It will be interesting to watch…

Twitter and OAuth from C# and .NET

In some of my previous posts I discussed Twitter and OAuth. The reason I sniffed the protocol enough to describe it here was that I wanted to build an application that posted Tweets.  Actually, I wanted to go one better – I wanted an app that would post tweets and images, via Twitter’s /1/statuses/update_with_media.xml API call. This was for a screen-capture tool.  I wanted to be able to capture an image on the screen and “Tweet it” automatically, without opening a web browser, without “logging in”, without ever seeing a little birdy from a Twitter logo.  In other words I wanted the screen capture application to connect programmatically with Twitter.

It shouldn’t be difficult to do, I figured, but when I started looking into it in August 2011, there were no libraries supporting Twitter’s update_with_media.xml, and there was skimpy documentation from Twitter.  This is just par for the course, I think. In this fast-paced social-media world, systems don’t wait for standards or documented protocols.  Systems get deployed, and the code is the documentation.  We’re in a hurry here! The actual implementation is the documentation.  Have at it, is the attitude, and good luck!  Don’t think I’m complaining – Facebook’s acquisition of Instagram provides one billion reasons why things should work this way.

With that as the backdrop I set about investigating. I found some existing code bases, that were, how shall I put it? a little crufty.  There’s a project on google’s code repository that held value, but I didn’t like the programming model. There are other options out there but nothing definitive. So I started with the google code and reworked it to provide a simpler, higher-level programming interface for the app.

What I wanted was a programming model something like this:

var request = (HttpWebRequest)WebRequest.Create(url);
var authzHeader = oauth.GenerateAuthzHeader(url, "POST");
request.Method = "POST";
request.Headers.Add("Authorization", authzHeader);
using (var response = (HttpWebResponse)request.GetResponse())
  if (response.StatusCode != HttpStatusCode.OK)
    MessageBox.Show("There's been a problem trying to tweet:", "!!!");

In other words – using OAuth adds one API call .  Just generate the header for me!  But what I found was nothing like that.

So I wrote OAuth.cs to solve that issue.  It implements the OAuth 1.0a protocol for Twitter for .NET applications, and allows desktop .NET applications to connect with Twitter very simply.

That code is available via a TweetIt app that I wrote to demonstrate the library.  It includes logic to run a web browser and programmatically do the necessary one-time copy/paste of the approval pin.  The same OAuth code is used in that screen-capture tool.

OAuth is sort of involved, with all the requirements around sorting and encoding and signing and encoding again.  There are multiple steps to the protocol, lots of artifacts like temporary tokens, access tokens, authorization verifiers, and so on.  But a programmer’s interface to the protocol need not be so complicated.


The Quiet Revolution in Software Development

There’s a natural human resistance to change. Everyone has it, everyone is subject to it. Some of us are more aware than others of our own tendencies to resist change unconsciously.  But by and large, all of us like to minimnize surprises, like to feel that we are in control.  We have enough going on, right? Especially in a work environment, where compensation is dictated by achievement and performance is judged and weighed, we don’t like to push the envelope lest we fail. We might lose that pay raise, we might even lose our jobs.

So when a new approach to project management comes along, it’s not surprising to find resistance.  It’s the conservative approach, and there’s a lot to be said for being consciously conservative in business.

On the other hand software project management is just screaming for a new approach. The domain is novel enough that the analogues we’ve tried to apply – Software as system design, Software development as building architecture and design, distributed systems development as city planning – have always been less than satisfactory.  Yes, software development is a little bit like those things, but it is a lot unlike them too.  If we blindly attempt to lay models from those domains into software development, we’ll fail.

Not only is software unique, it is also evolving rapidly. This is cliche, but the implications are sometimes overlooked. Developing a software project today is much, much different than developing a software project 15 years ago, even in the same industry. In 1997, the web was hot, and everyone wanted to figure out how to web-enable their business systems. These days, the web is the platform.  Where before we were delighted to be free of green screens, now we demand integration with mobile consumer-oriented devices. Building inspectors want to bring their ipad’s to jobs to fill out forms, take pictures, and submit their reports over the cell network. These use cases were firmly in the realm of miracle only a few years ago. Now they are de rigueur.

And the ever-expanding list of demands – for more and more connections, more integration, front-ends, back-ends, reporting systems, feedback systems – this explosion of possibility has implications for how we execute software projects. Not only is the list expanding, but it is also ever-shifting.  This is why the building analogy fails: buildings last for years, while we design software expecting to re-design it or extend it in 4 months. We expect it!  There is a demand for constant change, a demand for more or less continuous evolution of business systems.

The waterfall – the comfortable, conservative, well-known approach where there are clear handoffs, lots of documents describing exactly what is happening when, lots of reports, formalized requirements documents, many review meetings – that model simply cannot work any longer, not with the changes in software we’ve seen. This is a model that made sense in projects where testing was expensive and slow, driven by humans. With those economics, it made sense to make sure the plan was rock solid and air tight before we took the first step.

But that model no longer serves us. There’s been a slow but undeniable revolution in software development processes, driven not by hype or synthetic demand driven by vendors, but by a real improvement in results. I’m talking about Scrum and Agile methods. Iterative approaches that favor learn-as-you-go approach, with lots of automated testing that drives many small corrections, rather than a rigorous lengthy planning process upfront.  Software projects  that use these methods are more likely to succeed today than projects using the old-school waterfall methods, if we judge success as on-time, meeting requirements, and on-budget.

Software companies, like Google, Microsoft, games companies, and other organizations that make their money mostly or wholly from software, know this. They’ve been steadily and quietly increasing their commitment to test-driven developments, sprints, Scrummy project management. This isn’t about new products – it’s about new practices.

But larger companies that aren’t in the software business – the ones that think of themselves as manufacturing companies, or financial services companies, or healthcare providers, or telecom – some of these have been slower to adopt these practices. Conservative business people run these companies and they have good reason to tread carefully.

But I’ve got news for you: Scrum is now conservative. It just works better. It’s not hard to do, though it does require some new thinking.  You don’t need a squad of A players to pull this off. You don’t need to raid Microsoft’s dev teams. You can do this with competent developers and competent project managers; with B and C people, the people most companies in the world are stocked with.  In light of this, any software project manager or CIO who prefers to lean toward Waterfall methods for  new development efforts, is taking on unnecessary risk.

Yes, there’s a hesitancy to embrace new things when large sums of money are at stake. Rightly so.  But Agile and Scrum are no longer new.  They are no longer unproven.  You’ve been standing by the side of the pool long enough.  It’s time to jump in the water.


Big Data: Real benefits or Hype?

I’m a technologist. I believe technology, well utilized, can advance business goals. A business can derive a signficant advantage from making the right technology moves, exploiting information in just the right way.

But I am a bit skeptical of the excitement in the industry around Big Data, MapReduce, and Hadoop. While Google obviously has derived great benefit from MapReduce over the years, Google is special. Most businesses do not look like Google, and do not have information management requirements that are similar to Google’s. Google custom-constructs their PCs. At Google, the unit of computer hardware deployment is “the warehouse.”

If you underwrite insurance, or process medical records, or do scanning of transactions for fraud, or logistics optimization, or statistical process control, or any one of a variety of other typical business information tasks, your company is very much not like Google. If you don’t have hundreds of millions of users, generating billions of transactions, then you’re not like Google, and you should not try to emulate their technology strategy. Big Table is not for you, MapReduce is not something that will give you a strategic advantage.

Big Data seems to be the industry’s next touchstone.  Everyone feels they need th “check the box.”  There’s lots of interest by buyers, so vendors believe they need to talk about it. The tech press, with their persistently positive view of Google, encourages this. Breathless analyst reports fuel the flames. CS programs at universities teach MapReduce in 1st-year courses. Devs put MapReduce on their resume. All this combines to produce a self-reinforcing cycle.

But for most CIOs, MapReduce is a distraction.  In this view, I am persuaded by Dewitt, Stonebraker et al. CIOs should be focusing on figuring out how to better utilize the databases they already have.  Figure out cloud, and figure out how to improve management and governance of IT projects. Are you agile enough?  Are you doing Scrum?  Figure out what major pieces you can buy from your key technology partners.

I have read user stories of people using MapReduce to scan through log files, tens of gigabytes of log files.  Seriously?  Tens of gigabytes fits on a laptop hard-drvie. Unless you are talking about multiple terabytes of information, MapReduce is probably the wrong tool.

If you are doing analysis of the human genome, or weather modelling, or if you work for NSA or Baidu, then yes, you need MapReduce.  Otherwise, Big Data is not yet mainstream.