We’ve been looking for the elephant in the room for some time. We knew he was there, but we just couldn’t find him. It’s clear that he is now here and his name is Hortonworks. As such, we are very excited to announce today that Index Ventures has made an investment in Hortonworks.
The elephant toy - Hadoop - has become a household name in the Big Data sector these days and we’ve been tracking it for some time at Index. The Big Data world is complex and there are many components of it, but at the core of it all is Apache Hadoop’s revolutionary compute and storage architecture for data. We think that this might be one the most significant trend in data architecture in a decade.
To understand Apache Hadoop’s impact on compute and storage architectures, it’s important to understand where it came from and why it was needed in the first place. While applications have always produced a lot of data, we only ever stored the “valuable” data. This data was labeled and tagged and ultimately stored in neatly architected databases. Meanwhile, we simply ignored data like application logs and website logs, which offered the potential for in depth user insights, but were too voluminous and lacked the structured necessary for data systems at the time. It’s also important to realize that this unstructured data is increasing at exponential rates. As IT infrastructures across the economy get instrumented with Internet-scale solutions, the amount of data product increases exponentially – which further exacerbates the problem and expands the opportunity.
Google was one of the first companies to tackle the challenge of parsing through vast amounts of data in an attempt to make sense of it, all in a cost-efficient way. Not only did they have to contend with massive volumes of data, they also had to deal with large amounts of unstructured data, as websites didn’t have nice labels or tags. They came up with a concept called MapReduce, which solved both the scale problem and the cost problem for crawling and indexing unstructured data across the web.
MapReduce is the modern version of the “grid computing” that we talked about for years, but with a twist. The old way of parsing lots of data was to have a large computer (expensive) that fetched (expensively stored) unprocessed data, processed it and then returned it. MapReduce parsed up both the compute load and the stored data in moderately sized chunks and distributed it out across cluster of cheap/commodity computers with cheap disks to process. The processed data would then be reassembled into intelligible “knowledge” that helped users run their business. This is a big deal because it completely changes the cost and scale equation to computing big amounts of data.
Google published a white paper of MapReduce and a few smart engineers in the open source community cooked up Hadoop. Since the original creation, the little elephant has moved in as the heart and soul of many Internet companies’ computing architecture. Yahoo was the biggest and most active user, but Facebook, Twitter, Amazon, Hulu, and many others have joined the herd. The uses are varied and unique, but all of them center on using Hadoop to get insight from data that was previously discarded.
As an open source project, Apache Hadoop has many parts and a large community of contributors that create it, however, the core Hadoop engineering team from Yahoo has been a driving force in making Hadoop what it is today. Through some stormy times at Yahoo, Eric Baldeschweiler and his colleagues stuck together as a team to work on what is becoming one of the most significant trends in computing. Particularly noteworthy is Yahoo’s live implementation that is now sized over 42,000 CPUs, with their largest cluster spanning 4,500 nodes. That is some serious scale.
Early this year, Yahoo decided that the Hadoop asset should form the core of a new independent company. In late June, Hortonworks was formed to pursue the full potential of Apache Hadoop. For Index, we were delighted to have one of our long term friends – Rob Bearden – become the President of Hortonworks. Rob was the COO at both JBoss and SpringSource – two of the most successful venture-backed open source companies in the last decade. Rob’s been on the board at Pentaho and Gluster (RHAT) with us – a pair of terrific Index open source stories. His knowledge of the open source market and ability to build a world-class organization around great technologies make him the perfect complement to Eric.
Hadoop has clearly become one of the most vibrant sets of projects in the Apache universe. While open source projects thrive on the contributions made by the community; in our experience with MySQL and others, the greater the contributions than an organization makes to the project, the more they can shape the destiny of the project. After doing our analysis, it became obvious to us that Hortonworks is the most prolific contributor to the technology and thus extremely well positioned for the coming years.
We tip our hats to our friends at Benchmark Capital who pulled off a coup by helping to create Hortonworks and bringing together the founding team there. We are privileged to be working with them on another promising project. Open source in is our blood at Index. Aside from MySQL, we have been investors in OpenX, Cloud.com, Pentaho, Trolltech, Zend – more than a dozen open source companies. We think we have a lot to add to this adventure.
While Apache Hadoop has been something of a Silicon Valley phenomenon, the trends that enterprises face everywhere are undeniable – all enterprise face a tsunami of unstructured data coming their way. Big data is challenge and an opportunity in virtually every sector of the economy: healthcare (drug discovery, patient care), public sector (taxation, demography), retail (demand signal, supplier management), financial services (sales and trading, analytics), and so forth. Corporations in most of these sectors are struggling to find a scalable and cost effective solution, not just to “deal with” the data, but also to transform it into a competitive advantage.
Across all of these industry verticals, big data and it’s ecosystem will undoubtedly be the solution. While most enterprises are still in the early stages of adopting Hadoop, it’s clear to us that if we fast-forward a few years, there will be big herds of elephants running around enterprise architectures and Hortonworks will be in the center of it all.