Friday, January 17, 2014

Twitter Open-Sources its MapReduce Streaming Framework Summingbird

Twitter has open sourced their MapReduce streaming framework, called Summingbird. Available under the Apache 2 license, Summingbird is a large-scale data processing system enabling developers to uniformly execute code in either batch-mode (Hadoop/MapReduce-based) or stream-mode (Storm-based) or a combination thereof, called hybrid mode.

In order for Twitter to be able to keep up processing 500 millions tweets and growing, they had to find a replacement for their existing stack that required manually integrating MapReduce (Pig/Scalding) and streaming-based (Storm) code. The main motivation to create Summingbird, mentioned by the Twitter engineers, came from the realization that running a fully real-time system on Storm was difficult due to:

  • Re-computation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log-loading mechanism. 
  • Storm is focused on message passing and random-write databases are harder to maintain. 
This insight led to Summingbird, a flexible and general solution addressing the engineers’ practical issues with the existing approach:

  • Two sets of aggregation logic have to be kept in sync in two different systems 
  • Keys and values must be serialized consistently between each system and the client 
  • The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results
Read more here

Leave a Reply

All Tech News IN © 2011 & Main Blogger .