Thursday, December 26, 2013

Hourglass: Incremental Data Processing in Hadoop

For a large scale site such as LinkedIn, tracking metrics accurately and efficiently is an important task. For example, imagine we need a dashboard that shows the number of visitors to every page on the site over the last thirty days. To keep this dashboard up to date, we can schedule a query that runs daily and gathers the stats for the last 30 days. However, this simple implementation would be wasteful: only one day of data has changed, but we'd be consuming and recalculating the stats for all 30.

A more efficient solution is to make the query incremental: using basic arithmetic, we can update the output from the previous day by adding and subtracting input data. This enables the job to process only the new data, significantly reducing the computational resources required. Unfortunately, although there are many benefits to the incremental approach, getting incremental jobs right is hard:

  • The job must maintain state to keep track of what has already been done, and compare this against the input to determine what to process. 
  • If the previous output is reused, then the job needs to be written to consume not just new input data, but also previous outputs.

There are more things that can go wrong with an incremental job, so you typically need to spend more time writing automated tests to make sure things are working. To solve these problems, we are happy to announce that we have open sourced Hourglass, a framework that makes it much easier to write incremental Hadoop jobs. We are releasing Hourglass under the Apache 2.0 License as part of the DataFu project. We will be presenting our "Hourglass: a Library for Incremental Processing on Hadoop" paper at the IEEE BigData 2013 conference on October 9th.

Read more here

Leave a Reply

All Tech News IN © 2011 & Main Blogger .