How not to do APIs; and …My New Job

Having access to a quick way to get dictionary lookups of English words while using the computer has always been useful to me. I don’t like to pay the “garish dancing images” ad tax that comes with using commercial dictionary sites like; or worse, the page-load-takes-10-seconds tax. I respect Merriam-Webster and appreciate all they’ve done for the English language over the years, but that doesn’t mean I am willing to put up with visual torture today. No, in fact I do not want to download a free e-book, and I do not want to sign up for online courses at Capella University. I just want a definition.

Where’s the API?

I’m a closet hacker, so I figured there had to be an API out there that allowed me to build something that would  do dictionary lookups from … whatever application I was currently using.

And looking back, this is an ongoing obsession of mine, apparently. In the job with Microsoft about 10 years ago, I spent a good deal of my time writing documents using Microsoft Word.  I had the same itch then, and for whatever reason I was not satisfied with the built-in dictionary.  So I built an Office Research Plugin, using the Office Research Services SDK (bet you never heard of that), that would screen-scrape the Merriam-Webster website and allow MS-Word to display definitions in a “smart pane”.  The guts of how it worked was ugly; it was brittle in that every time M-W used a different layout, the service would break.  But it worked, and it displayed nicely in Word and other MS-Office programs.

The nice thing was that Microsoft published an interface that any research service could comply to, that would allow the service to “plug in” to the Office UI.  The idea behind Office Research Services was similar to the idea behind Java Portlets, or  Sharepoint WebParts, or even Vista Gadgets, though it was realized differently than any of those. In fact the Office Research Services model was very weird, in that it was a SOAP service, yet the payload was an encoded XML string.  It wasn’t even xsd:any. The Office team failed on the data model there.

It was also somewhat novel in that the client, in this case the MS-Office program, specified the interface over which the client and service would interact.  Typically the service is the anchor point of development, and the developer of the service therefore has the prerogative to specify the service interface. For SOAP this specification was provided formally in  WSDL, but these days the way to connect to a REST service can be specified in WADL or Plain-old-Documentation.  The Office Research Service was different in that the client specified the interface that research services needed to comply to.  To me, this made perfect sense: rather than one service and many types of clients connecting to it, the Office Research Service model called for one client (Office) connecting to myriad different research services. It was logically sensible that the client got to specify the interface.  But I had some discussions with people who just could not accept that.  I also posted a technical description showing how to build a service that complied with the contract specified by Office.

The basic model worked well enough, once the surprise of the-client-wears-the-pants-around-here wore off. But the encoded XML string was a sort of an ugly part of the design. Also, like the Portlet model, it mixed the idea of a service with UI; the service would actually return UI formatting instructions. This made sense for the specific case of an Office service, but it was wrong in the general sense. Also, there was no directory of services, no way to discover services, to play with them or try them out. Registration was done by the service itself. There are some lessons here in “How not to do APIs”.

Fast Forward 9 Years

We’ve come a long way.  Or have we?  Getting back to the dictionary service, WordNik has a nice dictionary API, and it’s free! for reasonable loads, or until the good people at WordNik change their minds, I guess.

It seemed that it would be easy enough to use. The API is clear; they provide a developer portal with examples of how to use it, as well as a working try-it-out console where you can specify the URLs to tickle and see the return messages. All very good for development.

But… there are some glaring problems. The WordNik service requires registration, at which point the developer receives an API key. That in itself is standard practice. But, confoundingly, I could not find a single document describing how to use this API key in a request, nor a single programming example showing how to use it. I couldn’t even find a statement explicitly stating that authentication was required. While the use of an API key is pretty standard, the way to pass the API key is not. Google does it one way, other services do it another. The simplest way to document the authentication model is to provide a short paragraph and show a few examples. Wordnik didn’t do this. Doesn’t do this. I learned how to use the WordNik service by googling and finding hints from 2009 on someone else’s site. That should not be acceptable for any organization offering a programmable service.

In the end I built the thing – it’s an emacs plugin if you must know, and it uses url.el and json.el to connect to Wordnik and display the definition of the word at point. But it took much longer, and involved much more frustration than was necessary. This is not something that API providers want to repeat.

The benefits of  interconnecting disparate piece-parts have long been tantalizingly obvious.  You could point to v0.9 of the SOAP spec in 1999 as a key point in that particular history, but IBM made a good business of selling integration middleware (MQ they called it) for at least a decade before that.  Even so, only in the past few years have we seen the state of the art develop to enable this on an internet scale.  We have cheap, easy-to-use server frameworks; elastic hosting infrastructure; XML and JSON data formats; agreement on pricing models; an explosion in mobile clients.

Technologies and business needs are aligning, and this opens up new opportunities for integration. To take advantage of these opportunities, companies need the right tools, architectural guidance, and solid add-on building blocks.

I’ve recently joined Apigee, the API Company, to help companies exploit those opportunities.  I’ll say more about my job in the future. Right now, I am very excited to be in this place.


AWS “High I/O” EC2 instances

A while back I commented on Amazon’s DynamoDB and disagreed with the viewpoint from that using SSD for storage was a “radical step.” In my comments, I predicted that

We will see SSD replace mag disk as a mainstream storage technology, sooner than most of us think.


Amazon will just fold [SSD] into its generic store.

Now, Amazon has announced the availability of “high I/O” instances of EC2. They offer 2 TB of local SSD-backed storage, visible to the OS as a pair of 1 TB volumes.
The SSD storage is local to the instance.

Was that sooner than you thought?

Next question:  which compute tasks are not well-suited to deployment on “high I/O” instances of EC2?

The only reason Amazon describes these instances as “high I/O” is that they have a ton of existing magnetic disk already deployed.  We should all begin to think of  SSD-backed storage as “standard”, and magnetic platters as “low I/O”. People will rapidly refuse to pay the magnetic disk tax. It’s silly to pay for CPU that is spent waiting for heads to meet up with the appropriate location on a magnetic platter.

Going forward, the “High I/O” moniker will disappear, as it will be cheaper for Amazon to deploy and operate SSD. There may be a price premium today for “High I/O” but that is driven by temporary scarcity, not by actual operational costs.

What Amazon will do with all its magnetic drives is an open question, but be assured it will turn them off. The savings in A/C costs alone, associated to dissipating the heat generated by mechanical drives, will compel Amazon to transition rapidly to full SSD.



HBR Blogger says Projects built on Short, Iterative cycles might just work

Harvard Business Review publishes an aggregated blog.  I like the content there, because there are varied perspectives from different writers, and the posts span a number of different topic areas. Recommended.

Better than Waterfall?

Recently, Jeff GotHelf posted an item comparing the old-school waterfall software development project management approach with a novel approach based on short, iterative development cycles with lots of customer feedback.

He thinks this new, unnamed project management model might actually bring some benefits.  !!  Hmm, I’ll have to check it out.  😉


NoSQL is apparently NOT going to deliver World Peace

Peter Wayner at InfoWorld has articulated “Seven Hard Truths” about NoSQL technologies. Among them:

  • It’s nice to have JOINs; NoSQL has none.
  • It’s nice to have transactions
  • After 30 years of development, it seems that SQL Databases have some solid features, like the query analyzer.
  • NoSQL is like the Wild West; SQL is civilization
  • Gee, there sure are a lot of tools oriented toward SQL Databases.

Intereesting synthesis, but nothing really novel here. Michael Stonebraker articulated these things in 2009, and lots of people who’ve built information-driven companies in the past 6 or 7 years on traditional SQL datastores had the same insight, even if they didn’t bother to articulare it this way.

SQL Databases work. There are lots of tools, people know how to use and optimize them. The metaphor is well understood, the products are mature, the best practices are widely disseminated throughout the industry. None of these are true with the big NoSQL technologies.

There is value in NoSQL. Some very successful companies have considered SQL stores and dismissed them as inappropriate or inadequate to their tasks. Facebook, Google and Twitter have famously used NoSQL to accomplish things that would not be possible with technology that has evolved to serve the needs of traditional transaction-oriented enterprises.

Ironically, the shiny object that is NoSQL has now captured the attention of IT people in traditional enterprises, the very audience that the designers of NoSQL technologies ignored when producing their solutions. Does this make sense?

Yes, there’s a place for NoSQL. No, it will not take over the world, or replace the majority of enterprise data storage needs, anytime soon.  There are opportunities to take advantage of new technologies, but unless you are the next Twitter (and let’s face it, you’re not…) you probably do not need to emulate the Twitter data architecture. What you should do is combine your existing SQL data strategy with small doses of NoSQL, deployed tactically where it makes sense.


PHP Makes People Sad

Just read an enjoyable rant entitled PHP: a fractal of bad design by a nerd who calls himself Eevee.

A good effort!

I also found a link to PHP Sadness there, and a bunch of other links to sites that complain about PHP.

This kind of criticism is correct, and valid, but it’s also pretty common, and low-hanging fruit.  I mean, come on.  We all know this stuff, right?  We just haven’t bothered to catalogue all the problems.

The other problem with this criticism is … reality.   PHP has had these problems since forever, and if they were really so significant, then no one would use it at all.  So there is something of value in PHP, and some part of it’s design is helping people get things done.

Yes, there are a million pitfalls.  Yes, there is a lack of consistency across a broad swath of the PHP built-in libraries. But apparently the people using PHP don’t suffer all that much for it.  Lots of people use PHP to build simple systems quickly, without getting all tangled up about whether to check exceptions or not, or whether “1” is the same as 1.

If you want a beautifully designed and consistent programming language environment, PHP is not it.  Ok, then.  Move along. No one is forcing you to use PHP.


Google’s Compute Engine: do you believe it?

Google has become the latest company to offer VM hosting, joining Microsoft (Azure) and Amazon (AWS), along with all the other “traditional” hosters.

Bloomberg is expressing skepticism that Google will stick with this plan.  Who can blame them? If I were a startup, or another company considering a VM hoster decision, I’d wonder: Does Google really want to make money in this space, or is it just trying to take mindshare away from Amazon and Microsoft?

Google still makes 97% of its revenue and a similar proportion of its profit from advertising. Does cloud computing even matter to them? You might say that Amazon is similar: the company gets most of its revenue from retail operations. On the other hand, Jeff Bezos has repeatedly said he is investing in cloud compute infrastructure for the long haul, and his actions speak much louder than those words. Clearly Amazon is driving the disruption. Microsoft for its part is serious about cloud because competing IaaS threatens its existing core business. Microsoft needs to do well in cloud.

As for Google – Do they even care whether they do well with their IaaS offering?

Bloomberg’s analysis resonates with me. Google has sprinkled its magic pixie dust on many vanity projects: phone OS, tablets, blogging, Picasa, web browsers, social networking, Go (the programming language). How about Sketchup? But it really doesn’t matter if any of those projects succeed. All of them added up together are still irrelevant in the shadow of Google’s Ad revenue. The executive management at Google know this, and act accordingly.

Would you bet on a horse no-one cares about?


Sauce Labs explains its move from NoSQL CouchDB to old-skool MySQL

Sauce Labs has rendered a valuable service to the community by documenting the factors that went into a decision to change infrastructure – to replace CouchDB with MySQL.

Originally the company had committed  to CouchDB, which is a novel NoSQL store originating from a team out of MIT.  CouchDB is termed a “document store” and if you are a fan of REST and JSON, this is the NoSQL store for you.

Every item in CouchDB is a map of key-value pairs of arbitrarily deep nesting.  Apps retrieve objects via a clean REST api, and the data is JSON.  Very nice, easy to adopt, and, with the ubiquity of json parsers, CouchDB is easy to access from any language or platform environment. Speaking of old-school, I built a demo connecting from Classic ASP / Javascript to CouchDB – it was very easy to produce.  I also did a small client in PHP, C#, Python – all of them are 1st class clients in the world of CouchDB.

It really is a very enjoyable model for a developer or systems architect.

For Sauce Labs, though, the bottom line was – drumroll, please – that CouchDB was immature.  The performance was not good. Life with incremental indexes was … full of surprises.  The reliability of the underlying data manager was substandard.

Is any of this surprising?

And, MySQL is not the recognized leader in database reliability and enterprise readiness, which makes the move by Sauce Labs even more telling.

Building blocks of infrastructure earn maturity and enterprise-readiness through repeated trials.  Traditional relational data stores, even open source options, have been tested in real-world,  high-load, I-don’t-want-to-be-woken-up-at-4am scenarios. Apparently CouchDB has not.

I suspect something similar is true for other NoSQL technologies, including MongoDB, Hadoop, and Cassandra. I don’t imagine they would suffer from the same class of reliability problems reported by Sauce Labs. Instead, these pieces lack maturity and fit-and-finish in other ways. How difficult is it to partition your data? What are the performance implications of structuring a column family a certain way? What kind of network load should I expect for a given deployment architecture? These are questions that are not well understood in the NoSQL world.  Not yet.

Yes, some big companies run their businesses on Hadoop and other NoSQL products. But chances are, those companies are not like yours. They employ high-dollar experts dedicated to making those things hum. They’ve pioneered much of the expertise of using these pieces in high-stress environments, and they paid well for the privilege.

Is NoSQL ready for the enterprise?

Ready to be tested, yes. Ready to run the business?  Not yet.

In any case, it’s very valuable for the industry to get access to such public feedback. Thanks, Sauce Labs.

Redmonk’s Analysis of Microsoft Surface is Naive

Stephen O’Grady of Redmonk, an analysis firm, looked at Microsoft Suface and concluded that the business model around software is in long-term decline.

…another indication that software on a stand alone basis is a problematic revenue foundation.

Mr O’Grady’s analysis is naive. His analysis casts software as a business, rather than a tool that large companies use in support of their business strategy.

Arthur C. Clarke, the sci-fi writer, is reputed to have observed that “Any sufficiently advanced technology is indistinguishable from magic.”

In that spirit, I observe that any sufficiently advanced technology company uses a unique combination of software, hardware, and services in pursuit of its business strategy.

Mr O’Grady is hung up on how a company monetizes its intellectual property. He distinguishes Google from Microsoft on the basis of their monetization strategy: Google makes most of its revenue and profit selling ads, while for Microsoft, the revenue comes primarily from software product licenses.

It’s a naive, shallow distinction.

For a while, the “technology space” was dominated by companies that produced technology and then tried to make money directly, by selling technology – whether that was hardware or software. But things have not been so simple, for a long while.

Mr O’Grady accurately points out that early on Microsoft chose to hedge across hardware technology companies, selling software and making money regardless of who won the hardware war.

Less famously, IBM tried competing in the high-volume hardware and software arenas (PCs, OS2, Lotus, VisualAge, etc) before adopting a similar zag-while-they-zig strategy. IBM chose to focus on business services back in the 90’s, steadily exiting the PC business and other hardware businesses, so that regardless which hardware company won, and regardless which software company won, IBM could always make money selling services.

Microsoft and IBM adopted very similar strategies, though the anchors were different. They each chose one thing that would anchor the company, and they each explicitly chose to simply float above an interesting nearby competitive battle.

This is a fairly common strategy. All large companies need a strategic anchor, and each one seeks sufficient de-coupling to allow some degree of competitive independence.

  • Cisco bet the company on networking hardware that was software and hardware agnostic.
  • Oracle bet on software, and as such has acted as a key competitor to Microsoft since the inception of hostilities in 1975. Even so, Oracle has anchored in an enterprise market space, while Microsoft elected to focus on consumers (remember the vision? “A PC in every home”), and later, lower-end businesses – Windows as the LOB platform for the proverbial dentist’s office.
  • Google came onto the scene later, and seeing all the occupied territory, decided to shake things up by applying technology to a completely different space: consumer advertising. Sure, Google’s a technology company but they make no money selling technology. They use technology to sell what business wants: measurable access to consumers. Ads.
  • Apple initially tried basing its business on a strategic platform that combined hardware and software, and then using that to compete in both general spaces. It failed. Apple re-launched as a consumer products company, zigging while everyone else was zagging, and found open territory there.

Mr O’Grady seems to completely misunderstand this technology landscape. He argues that among “major technology vendors” including IBM, Apple, Google, Cisco, Oracle, Microsoft, and others, software is in declining importance. Specifically, he says:

Of these [major technology vendors], one third – Microsoft, Oracle and SAP – could plausibly be argued to be driven primarily by revenues of software sales.

This is a major whiff.  None of the “major technology” companies are pure anything. IBM is not a pure services company (nor is it a pure hardware company, in case that needed to be stated).  Oracle is not a pure software company – it makes good money on hardware and services. As I explained earlier, these companies each choose distinct ways to monetize, and not all of them have chosen software licensing as the primary anchor in the marketplace. It would make no sense for all those large companies to do so.

Mr O’Grady’s insight is that a new frontier is coming:

making money with software rather than from software.


Google has never made money directly from software; it became a juggernaut selling ads.  Apple’s resurgence over the past 10 years is not based on making money from software; it sells music and hardware and apps. Since Lou Gerster began the transformation of IBM beginning in 1993,  IBM has used “Services as the spearhead” to establish a long-term relationship with a client.

All of these companies rely heavily on software technology; each of them vary on how to  monetize that software technology. Add Facebook to the analysis – at heart it is a company that is enabled and powered by software, yet it sells no software licenses.

Rip Van O’Grady is now waking up to predict a future that is already here. The future he foretells – where companies make money with software – has been happening right in front of him, for the past 20 years.

Not to mention – the market for software licenses is larger now than ever, and it continues to grow. The difference is that the winner-take-all dynamics of the early days is no longer here. There are lots and lots of successful businesses built around Apple’s AppStore.  The “long tail” of software, as it were.

Interestingly, IBM has come full circle. In the early 90’s, IBM bought Lotus for its desktop suite including WordPro, 123, Notes, and Freelance. Not long after, though, they basically exited the market for high-volume software, mothballing those products.  Even so they realized that a services play opens  opportunities to gain revenue, particularly in software. Clearly illustrating the importance of software in general, the proportion of revenue and profit IBM gains from software has risen from 15% and 20% respectively, about 10 years ago, to around  24% and  40% today.  Yes: the share of IBM’s prodigious profit from  software licensing is now about 40%, after having risen for 10 years straight. They don’t lead with software, but software is their engine of profit.

It’s not that “software on a standalone basis is a problematic revenue foundation” as Mr O’Grady has claimed. It’s simply that every large company needs a strategic position.The days of the wild west are gone; the market has matured. Software can be a terrific revenue engine, for a small company. Also, it works as a high-margin business for large companies, as IBM and Microsoft prove. But with margins as high as they are, companies need to invest in a defensible strategic position. Once a company exceeds a certain size, it can’t be software alone.


Impressive factoids on Facebook and Hadoop

It’s common knowledge that Facebook runs Hadoop. The largest Hadoop cluster on the planet.

Here are some stats, courtesy of HighScalability, which scraped them from twitter during the Velocity conference:

  • 6 billion mobile messages every 30 minutes
  • 3.8 trillion cache operations in 30 minutes
  • 160m newsfeeds, 5bln realtime msgs, 10bln profile pics, 108 bln queries on mysql, all in 30 minutes

Now, some questions of interest:

  1. How close is the typical enterprise to that level of scale?
  2. How likely is it that a typical enterprise would be able to take advantage of such scale to improve their core business, assuming reasonable time and money budgets?

Let’s say you are CIO of a $500M financial services company.  Let’s suppose that you make an average of $10 per business transaction; and further suppose that each business transaction requires 24 database operations, including queries and updates.

At that rate, you’d run 50M*24 = about 1.2B database transactions … per year.

Scroll back up. What does Facebook do?  3.8B in 30 minutes.  Whereas 1.2B per year works out to be about 68,000 in 30 minutes. Facebook does 55,000 times as many database transactions as the hypothetical financial services company.

Now, let me repeat those questions:

  • If you run that hypothetical company, do you need Hadoop?
  • If you had Hadoop, would you be able to drive enough data through it to justify the effort of adoption?