Tuesday, July 21, 2015


Building operational simplicity into distributed systems, especially for nuanced behaviors, is somewhat of an art and often best achieved after gathering production experience. Apache Kafka's popularity can be attributed in large part to its design and operational simplicity. As we add more knobs and features, we try to go back and rethink ways of simplifying complex behaviors.

One of the more nuanced features in Apache Kafka is its replication protocol. Tuning Kafka replication to work automatically, for varying size workloads on a single cluster, is somewhat tricky today. One of the challenges that make this particularly difficult is knowing how to prevent replicas from jumping in and out of the in sync replica list (aka ISR). What this means from a user’s perspective is that if a producer sends a batch of messages “large enough”, then this can cause several alerts to go off on the Kafka brokers. These alerts indicate that some topics are "under replicated” which means that data is not being replicated to enough number of brokers thereby increasing the probability of data loss should those replicas fail or die. So it is important that the “under replicated” partition count be monitored closely in a Kafka cluster. However, this alert should go off only when some broker fails, slows down or pauses and not when the producer writes data of varying sizes. This unexpected behavior and the false alarms are the source of a lot of manual operational overhead and churn. In this post, I discuss the root cause of this behavior and how we arrived at the fix.

Key takeaway - for the best operational experience, express configs in terms of what the user knows, not in terms of what the user has to guess.

read more here

Leave a Reply

All Tech News IN © 2011 DheTemplate.com & Main Blogger .