Thanks to hashes the message key and modulos that over the number of partitions:
That way messages with the same key always end up on the same partition.
Note that messages are only guaranteed to be ordered within the context of a producer and partition. Records from multiple producers, or from a single producer on multiple partitions, can interleave.
Now that we know how messages are put onto topics, let’s see how they are consumed. When you start listening to a topic, by default the records from all partitions are routed to you. It’s common though, to have multiple instances of a microservice running at the same time to achieve higher throughput or availability. If they all start listening to the topic, each record gets processed by each instance, which is usually not what you want.
Consumer groups allow you to evenly divide the partitions among multiple consumers. When a microservice instance joins the consumer group, Kafka will reassign some of the partitions to it. Likewise, when an instance crashes, or leaves the group for another reason, its partitions will be assigned to other instances. Kafka makes sure the partitions are always evenly divided among the consumers in each group.
If there’s a topic where the number of records per partition are skewed, you might be in trouble. An instance might not be able to keep up, because it was assigned the partition with many records, while other instances are idle. It’s up to you to make sure that there are no partitions that have vastly more records than others.
Each consumer keeps track of which records it has processed. Since records are processed in order, a simple offset is enough. Every once in a while (5 seconds by default), a consumer will commit its offset to Kafka.
When a consumer leaves its group, its partitions are given to other consumer in the group. The new consumers will be able to start requesting records starting at the offset where the previous consumer stopped.
It is possible that a record was processed, but not yet committed. You’ll either have to start at the committed offset, or start processing new messages and skip everything that’s not yet processed. This is why Kafka can only guarantee that messages are delivered at least once, or at most once.
The analogy no longer really makes sense when we start duplicating data. With Kafka, we can process a single record multiple times. Multiple consumer groups can consume the same records. Topics can be stored with a replication factor of three for reliability. Topics can have a retention period after which records are deleted. All this is possible because data, unlike iron, can be duplicated easily.
This is a good place to end this post. We’ve covered all the major concepts of Kafka and you should have a general understanding of how Kafka works. Let’s wrap up with a short recap.
What we learned
Kafka is a distributed streaming platform that stores records in a durable way through replicating records across multiple servers. Topics consist of partitions, that store records in order. Partitioners decide which records belong on which partitions. Consumer groups are optional, and help distribute partitions among consumers for scalability. Offsets are committed as checkpoints for when consumers crash.
And that, in a nutshell, is how Kafka works.
About the author
Ruurtjan Pul (Twitter: @ruurtjan) is a data engineer at BigData Republic, a data science consultancy company in the Netherlands. We hire the best of the best in BigData Science and BigData Engineering. If you are interested in using Kafka and other data engineering tools on practical use cases, feel free to contact us at firstname.lastname@example.org.
- Apache Kafka
- Achieving stack and heap safety in recursive functions
- Feature importance — what’s in a name?
Understanding Kafka with Factorio was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.