876 words
4 minutes
The Essence of Kafka
2019-09-20

Kafka has been around for a while and there are plenty of good resources to get started with. I have been fairly impressed with the Confluent (creators of Kafka) folks who have done an impressive job of educating the developer community on Kafka. In spite of the plethora of information available online, for someone who is just getting started, it’s not too hard to miss the real essence of Kafka, especially given its overlap with similar technologies.

I have been part of discussions where engineers wanted to replace an existing point-to-point solution (implemented using an existing message queue with Kafka given that it offers higher throughput and scale. I have also seen Kafka being cast as a popular choice for an internal pub/sub platform or a “message bus”. While none of these descriptions of Kafka are necessarily inaccurate, I believe it’s helpful and productive to think rightly about a piece of technology - it’s essence and the sweet-spot of problems it solves. As mentioned earlier, all of what I have shared is widely available elsewhere, but I look the liberty to consolidate some thoughts which I hope will help point Kafka newbies in the right direction when it comes to thinking about Kafka.

What exactly is Kafka?#

It is a distributed commit log offering certain message semantics and guarantees. This sound like a mouthful, but I will do my best to break things down. Let’s start with the Log part, which is the foundational idea and an appropriate starting point to grok Kafka.

What is a (commit) Log?#

Here’s how Jay Kreps, one of the creators of Kafka, describes a log as -

A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time. Records are appended to the end of the log, and reads proceed left-to-right. Each entry is assigned a unique sequential log entry number.

The concept of a log, which is not novel to the world of computing, is described in more detail, particularly with it’s relevance to Kafka here. Logs are pervasive. We use them to record events, activities, facts etc. Logs are a repository of chronologically ordered facts, that can be reconstructed to create current state. Logs by definition offer an “append-only” mode of recording information.

Kafka provides a platform to publish (to) and consume (from) a commit log(s)

What’s the big deal about writing into and reading messages from a (log) file?#

File Read/Write operations are not a big deal, but doing it in a distributed fashion while offering read/write guarantees at large scale is non-trivial. Jay Kreps has elaborated on the topic of logs in a distributed context.

What are the components of Kafka that enable reading and writing to a distributed log(s)?#

Publishers (Producers) write messages into Kafka, and Consumers consume it. Kafka calls Producers and Consumers Clients. Clients don’t directly perform file write and read operations, instead they interact with Brokers (a.k.a. Kafka servers), which perform the read/write operations on behalf of the clients. The Brokers provide the central functionality in Kafka. Of the many things they do, they are responsible for orchestrating the flow of messages in the system, while ensuring messaging guarantees.

I hear things like topics and partitions, what are they?#

Think of a topic as a conduit/pipe through which messages flow. Publishers publishes messages to a topic, and Consumers subscribe/listen to a topic and consume messages from them. Partitions provide the ability to further parallelize the read/write operations within a topic. You can find plenty of resources online that describe topics and partitions in more detail; here’s one and another one

What does “Message semantics” mean?#

Message semantics typically deals with how messages are published (and consumed) within the messaging platform. A more detailed descriptions of this topic can be found here and here

What about “Message ordering” and does Kafka guarantee ordering?#

Most queuing systems typically guarantee FIFO ordering. Well, isn’t that what a Queue is supposed to be anyway? Kafka only guarantees message ordering within a topic partition. You can find more info on this topic (no pun intended) here

What makes Kafka “distributed”?#

Although Kafka can be run in a single-broker/non-distributed mode, in order to achieve higher throughput and scale, you will need more than a single broker; rather, you will need a cluster of brokers. Brokers (in a cluster) coordinate amongst themselves to carry out some critical responsibilities - message delivery across consumers, providing message guarantees etc. More about this topic - Controller Broker

Where does Zookeeper fit into all this?#

Zookeeper is the underlying technology that enables the brokers to work in a coordinated fashion.

Hmm, Kafka seems more complex that I thought#

Yes, the ‘distributed’ part and the capability of ‘exactly once message semantics’ is what makes operationally Kafka complex. But the good news is, the constructs needed to interact with Kafka is (relatively) simple.

When should I consider Kafka vs X,Z and Z?#

Please see the Kafka Vs the Rest for more info on this topic

Where can I learn more about Kafka?#

- Take a look at my [Kafka resource recommendations](/posts/kafka-resource-recommendations.md)

Kafka Vs the Rest#

Kafka is a streaming platform#