10 Hadoop-able Problems (a summary)

So, the new company I work for, Affiliate Window, are pretty awesome. Technically, they’re not driven by what is cool, or what the latest buzzword is on The Twitter that one of the directors saw on the telebox. They do what is necessary to get the job done, using the best tools. If this requires some in house dev, then time is found. If there’s a cool bit of tech from outside which fits the problem, then they’ll try it.

They’re also not hemmed in by the corporate, big enterprise world of “it’s the way others do it, so we should to”. They’re also good at long-term investment in their team and their tools. Plus, I get to use Ubuntu as my desktop. Rock on.

Anyway, a meeting was arranged for today where we could watch a presentation on Cloudera’s Hadoop (which you can see here at GoMeeting, although only on windows and only after registering (great, more vendor lockin!)). It was called ’10 Common Hadoopable Problems’ given by Jeff Hammerbacher (their Chief Scientist no less!) and was basically things that you can do with hadoop (that isn’t counting words…). I thought I would summarise them here, although I’d encourage every last one of you to watch it as it’s pretty interesting.

  1. Modelling True Risk – If you think about this in the context of banks or other financial institues (which is, well, banks) this is a really useful way of burrowing deeper into your customers. You can suck in data about their spending habits, their credit, repayments everything. Munge it all together and squeeze out an answer on whether to lend them more money.
  2. Customer Churn Analysis – Hadoop was used here to analyse how a telco retained customers. Again, data from many different sources, including social networks AND the calls themselves (recorded and then voice analysed, I guess) were used to work out how and why the company were losing or gaining customers.
  3. Recommendation engines – I don’t really need to explain this one do I? Thinking about this in terms of Google, this is like the ranking algorithm. Sucking in a bunch of factors like; popularity, link depth, buzz on Twitter etc and then scoring links for display in score order later.
  4. Ad Targeting – Similar to the recommendation engine, but with the added dimension of the advertiser paying a premium for better ad-space
  5. Point Of Sale Transaction Analysis – On this face of it, this seems simple and straightforward; analysing the data that is provided by your P.O.S device (your till). However, this could also include other factors like weather and local news, which could influence how and why consumers spend money in your store.
  6. Analysing Network Data To Predict Failure – The example given here was that of an electricity company which used smart-somethings to measure the electricity flying around their network. They could pump in past failures and current fluctuations and then pass the whole lot into a modelling engine to predict where failures would occur. It turned out that seemingly unconnected, small anomolies on the system were connected after all. This data wouldn’t have been able to be mined any other way.
  7. Threat Analysis/Fraud Detection – Another one for the financial sector and very similar to Modelling True Risk. Hadoop can be used to analyse spending habits, earnings and all sorts of other key metrics to work out a transaction is fraudulent. Yahoo! use Hadoop with this pattern to ascertain whether a certain piece of mail heading into Yahoo! Mail is actually spam.
  8. Trade Surveillance – Similar to Threat Analysis and Fraud Detection, but this time pointed squarely at the markets, analysing gathered historical and current live data to see if there is Inside Trading or Money Laundering afoot!
  9. Search Quality – Similar to the recommendation engine. This will analyse search attempts and then try to offer alternatives, based on data gathered and pumped into Hadoop about the links and the things people search for.
  10. Data “Sandbox” – This is probably the most ambigious, but the most useful Hadoop-able problem. A data sandbox is just somewhere to dump data that you previously thought was too big, or useless or disparate to get any meaningful data from. Instead of just chucking it away, throw it into Hadoop (which can easily handle it) then see if there IS data you can glean from it. It’s cheap to run Hadoop and anyone can attach a datasource and push data in. It allows you to make otherwise arbitrary queries about stuff to see if it’s any use!

As you can see, most of these boil down to “Aggregate Data, Score Data, Present Score As Rank”, which, at it’s simplest, is what Hadoop can do. But the introduction of the idea of a Data Sandbox and the ability, using Sqoop, to push the analysed data back into a relational database (for a data warehouse for example) means that you can run Hadoop independently and prove it’s worth in your business very cheaply.