Some may argue that there are only three certainties in life, death, taxes and Hadoop. In its report on Hadoop predictions for 2015, tech analyst firm Forrester, calls Hadoop a “rising star” in data analytics and claims “Hadooponomics,” the economics of storing and analyzing data in Hadoop, as the trigger that feeds a robust ecosystem of new tools and new distributions. In short, the report predicts that the Hadoop ecosystem will continue to grow and become many things to many new people, from analytics to an applications platform. Maybe they’re right, but that’s one story. There’s another.
This second narrative posits that in our digital, distributed and connected world that spews innovation and spits out old technologies, where a crowd-funded smart watch project can raise more than $18M in record time and a exasperating game about a flappy bird can go from 0-50M downloads in 28 days, open source projects form, stretch and sometimes rip apart faster than a new version of anything from Microsoft needing to be patched. It is about a technology hype machine with attention deficit disorder, where all eyes seem to shift in a blink from Hadoop MapReduce to newer open source projects like Spark, Kafka and Ceph.
This story, which is reaching some critical mass, posits a world after Hadoop, as we know it. It asks the solemn question, has Hadoop peaked in maturity, already? One thing seems certain that the batch processing MapReduce foundation that gave rise to today’s Hadoop ecosystem may be on its last legs. I was somewhat apprehensive about MapReduce back in 2008 – I was at eBay at the time. Back then I exchanged emails with industry watcher Curt Monash who wrote at the time, “eBay doesn’t love MapReduce.” At eBay, we thoroughly evaluated Hadoop and came to the clear conclusion that MapReduce is absolutely the wrong technology long-term for big data analytics. MapReduce solved problems that couldn’t be solved without a parallel system, but the problem was there were already more powerful and more mature parallel systems in market, like Teradata, for example. Now the whole industry agrees. MapReduce, for us at eBay, didn’t challenge status quo, and, at best, it was an incremental step in the right direction for open source technology.
What’s Spark? And, Why the Hype?
Since then, the open source community has matured and taken lots of newer, larger steps, with names like Spark and Ceph. The counter narrative to the general Hadoop hype, I mentioned, focuses on those two projects primarily.
First, Spark. The hype that Hadoop MapReduce created for years is now switching fast to Spark – it is everything that MapReduce was meant to be. Make no mistake about it; Spark was built as a competitor to the Hadoop tools ecosystem. Spark has been called “the next big thing” in big data and you can see the Hadoop vendors shifting their posture to address the new kid in town.
For sure, it is immature like MapReduce and Hadoop was a few years ago. And, as such, and smartly so, Databricks, the creators of Spark, have ducked around the question of whether Spark will replace Hadoop. They say that ‘we’re not going to replace Hadoop’ and ‘we’re going to run in Hadoop.’ But, guess what, you don’t even need Hadoop to run Spark. In fact, some are running it in an Open Stack cluster, rather than in Hadoop – the commercial product from Databricks runs on Amazon’s AWS. And, the same is true for other open source projects, like the distributed messaging system Kafka. Can you run Kafka in a Hadoop cluster? Sure. But, look at how many people run it. They run it in something like Open Stack.
Ceph and Red Hat’s Data Management Ambitions
Which brings me to the second project that could bring about an entirely new ecosystem of big data tools and data management options. That’s Ceph. Ceph is an open source distributed storage system that also includes its own high-performance POSIX-compatible (better compatibility with Linux and other operating systems) file system. That means it can ingest, update and delete on system.
In April 2014, Red Hat, a standard of success in commercializing open source technologies, bought Inktank, the creators and providers of Ceph. Red Hat is a dominant force in the open source market, and has a whole other level of experience and resources needed to popularize a project like Ceph. And, all signs point to Red Hat not stopping at big data with Ceph. They could very well start using Ceph as perhaps the standard file system of Red Hat Linux.
Read the tealeaves around the momentum of Spark and Ceph, open source tools that are clear alternatives to Hadoop MapReduce and HDFS respectively, and you start to understand this counter narrative. For all the hype, Hadoop market growth has been relatively stagnant. InformationWeek recently wrote a story on just this topic, pointing out the fact that machine learning and IT operations intelligence vendor Splunk leads a smallish big data (read Hadoop) market.
The question isn’t whether some critical Hadoop components are being replaced by new open source technologies. It is, and that’s a fact. The question is which path will Hadoop ultimately take?
Will the Hadoop ecosystem, as Forrester and others suggest, grow to encompass the newer open source technologies? Or, will technologies like Spark, Ceph, Kafka and others evolve into something entirely new? All of a sudden, when you pair Ceph with Open Stack, which in itself is fast growing, and takes analytics to the cloud, is this where the world of big data is heading? Is that where we will see the industry in the next 5-10 years?
There are certainly a lot of “what ifs” at this point and I don’t purport to absolutely know the answers to these questions – time will answer them with certainty. But I do know it is important to seriously consider this counter narrative to the deafening Hadoop MapReduce hype. Or, else, some might find themselves stuck inside a withering ecosystem.