Ce n'est pas un Kafka: Kafka is a Protocol

Apache Kafka is an aging open source project. It's time to accept that Kafka's protocol is what matters.

Mar 25, 2024

Confluent announced Tableflow at Kafka Summit this past week. Tableflow integrates Confluent Cloud with open table formats like Apache Iceberg. Confluent’s Kafka can now write to tiered storage (object stores like S3) using Parquet files and Iceberg metadata. I have been preaching about this idea for a while.

Jack Vanlightly describes Confluent Cloud’s (Kora’s) implementation:

Firstly, we can replace the native tiered storage format with Parquet files and Iceberg metadata directly. This creates a zero-copy storage representation of a stream as a single coherent dataset across Kora brokers and object storage. The Kora storage engine handles this Iceberg/Parquet storage tiering, object storage file compaction, retention, storage optimization, and schema evolution between Kafka topic schemas and Iceberg tables.

This feature, along with other developments, has really driven home how varied Kafka implementations have become. Kafka is now a protocol, not an Apache project. The Apache project is just one (aging) implementation.

Confluent, themselves, think about Kafka this way:

Just like the Apache Kafka API has evolved to be the de facto open standard for data streaming, we’re seeing Apache Iceberg evolve into the de facto open-table standard for large-scale datasets stored in lakehouses.

Confluent Cloud’s feature set has so diverged from Apache Kafka that it’s nearly unrecognizable. Confluent has offered tiered storage since at least January, 2020. Meanwhile, Apache Kafka just got tiered storage in 3.6.0, released in October, 2023. It’s not even ready for production yet—still early access.

And now Confluent Cloud is offering Parquet and Iceberg integration. KIP-1008 is trying to add similar support for the Apache project. But the design needs a lot of work and the discussion has died. It’s unlikely Apache users will get this feature anytime soon.

Looking beyond Confluent, the ecosystem is still more diverse. WarpStream [$] has built a truly serverless implementation of Kafka using only object stores for persistence. Redpanda has had a C++-based Kafka implementation for years; they’re now trying to morph into a serverless platform. AutoMQ forked Kafka to add tiered storage. S2 is on the cusp of launching its write-ahead log (WAL) with Kafka protocol compatibility. Even StreamNative has embraced the Kafka protocol on top of Apache Pulsar.

The evolution of popular open source infrastructure into a protocol is not unique to Apache Kafka. Redis’s protocol is widely adopted, as is PostgreSQL’s (something I wrote about in Databases are Commodities. Now What?). Even closed source systems like S3 have seen their protocols adopted as the de facto standard. Successful infrastructure is destined to be a protocol.

I’m OK with this. Though I’m no longer a CFLT shareholder, I recognize the need to make money. The Apache process, too, can be slow. Businesses sometimes need to move faster. And congregating around the protocol means vendors can try different things—different manifestations of the platonic ideal that is the protocol spec.

Frankly, I don’t see a pure open source business—a la Hortonworks—as a viable model. I am not alone in this; several open source developers have recently confided in me the same feeling. A post for another day.

All of this begs the question: what responsibility—if any—do companies have to their open source roots, or to the protocol itself? I don’t have a good answer. What I can offer is that clear communication is important.

Many companies find this uncomfortable. They worry—rightly so—that they’ll alienate users and suffer brand damage if they explicitly abandon their open source roots. Touching the proverbial open source third rail, so to speak. So they opt for strategic ambiguity.

Users are going to have to get comfortable inferring a company’s intentions based on their actions. In the case of Confluent and Kafka, it’s pretty clear that we have graduated from an Apache project to an open protocol. I’m excited about this (of course, I have a vested interest). The products we’re getting—Confluent Cloud, WarpStream [$], Redpanda, S2, and AutoMQ—are genuine improvements with (hopefully) sustainable business models.

Support this newsletter by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to someone.

Buy Now

I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list.