Friday, August 29, 2014

Building a Data Pipeline That Handles Billions of Events in Real-Time

At Metamarkets our goal is to help our clients make sense of large amounts of data in real-time. Our platform ingests tens of billions of new events every day, and currently comprises trillions of aggregated events. Our real-time analytics platform has two separate yet equally important goals: interactivity (real-time queries) and data freshness (real-time ingestion).

We’ve written before about how Druid, our open-source datastore, is able to offer fast, interactive queries. In this post, we’re going to focus on the challenges around achieving data freshness. We’ll talk about the batch-oriented pipelines we started with, and how we approached building real-time pipelines with these important goals:

  • Latency: Real-time latency means being able to query events seconds after they happen. 
  • Power: Most data pipelines do not simply copy data around; they need to join, transform, and aggregate data as it flows through the system. 
  • Reliability: Queries must accurately reflect the original input data; we don’t want to drop any events or introduce any duplicates. 
  • Scalability: The system must be able to handle our current load—tens of billions of events per day—and scale well into the future.
Read more here

Leave a Reply

All Tech News IN © 2011 & Main Blogger .