“Hadoop people” and “RDBMS people” – including some DBAs who have contacted me recently – clearly have different ideas about what Data Integration is. And both may differ from what Ted Friedman (twitter: @ted_friedman) and I (@merv) were talking about in our Gartner research note Hadoop Is Not a Data Integration Solution, although I think the DBAs’ concept is far closer to ours.
We went to some lengths to precisely map Gartner criteria from the Magic Quadrant for Data Integration Tools (see below) to the capabilities of what most people would consider the Hadoop stack – Apache Projects that are supported in a commercial distribution. Many of those capabilities were simply absent, with nothing currently available to perform them.
Moreover, even to the degree that some pieces/projects might meet some of the needs, there is nothing that ties them together into a “solution,” which itself was a carefully chosen word. Today, with Hadoop projects in general, we very often see bespoke, self-integrated, “build it yourself and good luck operating it” structures. By contrast, solutions, including those for data integration, provide the relevant pieces coherently in a way that ties together design, operation, optimization and governance. Leaving aside the absence of data quality tools or profiling tools of any kind in today’s supported Hadoop project stack, we don’t see that yet. And Ted and I note in our piece that Hortonworks, for example, implicitly acknowledged that by bundling Talend into its distribution. Talend itself places rather well in the Gartner Magic Quadrant for DI tools.
Hadoop is very useful for a lot of things – including analytics of some kinds, and ETL of some kinds, and for low-cost exploitation of data that is unsuitable for persisting in RDBMSs for a variety of reasons. It’s maturing, and steadily adding more capabilities, and is driving an economic refactoring of data storage and processing which will result in some (increasing amounts of) data being kept there and some (increasing amounts of) processes being performed there. In Gartner’s Logical Data Warehouse model, it occupies the spot for Distributed Process use cases. The relative size of that part of the landscape relative to repositories and to virtualization is yet to be determined. It will take some years to sort out, and it won’t stand still.
But platforms are not solutions. Hadoop can very much be a platform on which a DI solution can be built. But A solution? Not yet. For that, talk to the folks in the MQ referenced above. [added 2/13] Thanks for your comments and tweets – and keep them coming!