Making Data Accessible
I wanted to describe our latest investment as an online multi-tenant complex query tool for filtering and analyzing large streams of real time semi structured data. Marketing stopped me. It turns out that DataSift, the investment we just announced today, is all about Big Data for Social.
I like both descriptions. The buzz word bingo of Big Data for Social is light on specifics, but it is accurate. DataSift is broadly in the trend of Big Data, and while Social Data is not the only data the Company analyses, it is by far and away the largest data set. However, it was in the specifics of the technology and the applications it enables, that we saw the compelling investment opportunity.
What is Big Data Really?
Big Data is a Big Tent. The phrase has at least three different meanings. It's most precise meaning is the set of specific technologies around Mapr and Hadoop, publicized notably in the 2004 paper by Google, and brought to market by companies like Cloudera, Mapr and Hortonworks. This technology, which is broadly describable as the Hadoop stack, enables a non-real time (though that is changing) system for storing, counting and analyzing huge data sets. A second, more broad use of the term, has emerged to describe a whole range of newer technologies around NoSQL, HANA, and real time analytics, with the common thread being high volume data management technologies that move beyond the RDBMS paradigm of the last thirty years. These technologies can be either a substitute, or a compliment, to the core Hadoop stack.
Finally, the phrase "Big Data" now describes an entire movement in IT. The technology above made it easier than ever to handle enormous amounts of data, just as a range of industries, from genomics to the social web, started to generate enormous datasets. The combination has set off a mini gold rush as startups and large companies look to use data as a competitive weapon. Search for Big Data on the web and you get 2.3BN hits and the volume of searches is growing exponentially. In the startup world, we are seeing variants of the "Big Data for x industry" every day and this article by Geoffrey Moore gives a pretty good sense of how Big Data is spreading out.
Like any trend it can be overhyped, but the idea that more data and better analysis can lead to smarter predictions, is not a bad one. It received a stunning vindication last week, when Nate Silver predicted the 2012 election, simply by taking existing available information and analyzing it better than anyone else. Check out the hashtag #natesilver. Big Data is now mainstream, with Nate Silver being described as the Chuck Norris of Big Data!
DataSift and Big Data
DataSift fits in that second category of Big Data above. It is not building another Hadoop stack or even a Hadoop stack in the cloud. Instead the company has built a powerful platform and query engine that can sift through huge streams of real time data and find specific phrases, measure sentiment or find patterns. What is compelling about the technology is that it manages to be incredibly powerful, while at the same time, simple and accessible to use.
A user can go online, sign up, pay $100 dollars and using a visual interface, construct a sample query and run it against Twitter, blogs and every news source known to man. No complex analysis, no need to install software, and with the new user interface that was deployed last month, no need to learn the underlying language (although a more technical user can easily opt to look at the underlying Curated Stream Definition Language - CSDL - to write more complex queries). Finally, the platform allows users to query unstructured enterprise data as well as third party data.
DataSift and Social - Taking the Pulse of the Planet
Fires need fuel and Big Data engines need streams of Big Data. If you build an engine for querying large streams of real-time data, you pretty quickly want to point that engine at the mother lode of data streams, which is in social media and Twitter in particular. The Twitter stream is growing exponentially, and it has become the information pulse of the planet. Companies want to monitor that pulse for reasons that range from customer support to sentiment analysis, and news tracking, but how do you look at 500 MM unfiltered tweets a day?
You don't. Through a partnership with Twitter, DataSift has full access to the Twitter firehouse. A customer can write a query, run it against the entire Twitter stream and set it up so that, on an ongoing basis, the results stream real time into a business application. The customer can filter by Klout score, geo location, gender, language, subject or any combination thereof. From 500 MM tweets the customer gets only the two, ten or one hundred tweets that matter to them.
Using DataSift's Query Builder - A Simple Example
Say you want to track every mention on Twitter relating to cars or any related term and you want to limit the search to males, within a certain age, based in or around Texas. The screenshots below showcase how easy it is to build the query.
1. Selecting Tweets by Geo
2. Selecting Tweets by Age
3. Final Query with Geo, Age and Subject Filter
The query can be set to run continuously against multiple feeds with the results pushed directly into an application which can provide user alerts, take action or simply follow what is going on. Maybe the purpose is as simple as just tracking consumer sentiment, or as narrowly commercial as finding auto enthusiasts in Texas that might be willing to test drive an electric car (maybe narrow the search to Austin).
DataSift's Big Opportunity
DataSift works really well with software developers and application developers of all sorts. Part of how we discovered Datasift, was by seeing many of our software companies looking to partner with them to get access to the world of unstructured data DataSift handles the backend of managing the platform, integrating the data feeds and running the queries. The application company builds the application through which the results of the query are integrated into the enterprise workflow. Sample use cases include: customer support, lead generation, news tracking and financial trading. The company is also working with ERP vendors and relational BI tool vendors to enable easy integration of structured and unstructured data at the reporting tool level.
Just as Business Objects and Cognos were the query building tools of the relational database world, we believe that DataSift can be the query building tool for real-time streaming data. The way to make that happen will be to empower developers and end users to easily query all the data that is out there, and then let those developers find uses for that data, uses the Company has not even dreamed of.
We are excited to be working with @nik, @rmb and the team @DataSift.