You Should Be Streaming Data on S3, Neon DB Is a Masterpiece, a Reactive Edge DB, and more...

Streaming on S3 is finally here; Jack Vanlightly does a Neon tech teardown; and SKDB is the coolest database you don't know about.

Chris

Nov 28, 2023

You Should Be Streaming Data on S3

More and more people are buying into the idea that S3 is a good storage layer for streaming.

The three projects Yingjun mentions show three different architectures.

S3 as source of truth: WarpStream [$] is a stateless Kafka protocol-compatible system where S3 acts as the source of truth for all data.
Tiered storage: AutoMQ implements Kafka’s RemoteStorage interfaces to provide tiered storage on S3. Unlike WarpStream, AutoMQ still has Kafka brokers that store a small amount of data on EBS. RedPanda also supports tiered storage with a Kafka-compatible API.

Streaming data lakes: Paimon is not a Kafka-compatible messaging system. Instead, Paimon is a streaming data lake (similar to Hudi). Data is ingested via engines like Flink, Spark, or Hive (only Flink and Spark support streaming writes).

And Kafka finally has tiered storage in 3.6.0 (as an early-access feature). I had assumed this was already available, but it turns out Confluent was keeping it as a paid feature.

Neon DB Is a Masterpiece

Jack Vanlightly has been on a tear with serverless posts lately. His latest post is a Neon architecture teardown. I recommend reading both Jack’s post and the AWS’s Aurora paper that Neon is based on.

Neon takes PostgreSQL and replaces its storage layer with remote storage interfaces. The two components of the remote storage are a remote WAL and a page service that sits atop a BLOB store like S3.

Neon is such an elegant project. Some notes:

Neon uses QEMU instead of microVMs for live migration.
Neon’s Postgres has been patched to allow WAL and page service calls to go over network. These changes are not yet upstream.
Designers chose Paxos not RAFT because it worked nicely with their service design (clients, safekeepers, and pageservers).

I hadn’t come across QEMU before (or DRDB, which Jack also mentions). Neon chose QEMU—a full VM—over Firecracker because they wanted live migration. Firecracker and gvisor both only support snapshot and restore.

Project Highlight: SKDB

SKDB is a new kind of database: a reactive edge database that supports materialized views, table subscriptions, and diff’ing. Unlike SQLite and libsql, SKDB is rebuilt from the ground up to support such use cases.

SKDB is inspired by SQLite and supports the same subset of SQL (including transactions). What sets it apart is that it is also highly concurrent. SKDB supports processing complex queries from multiple simultaneous readers/writers without stalling other database users.

The project is quite young—a proof-of-concept—but it has on some really interesting building blocks: SkipStore and Skiplang. Stay tuned to learn more in my upcoming podcast interview with Julien Verlaguet, the CEO of SkipLabs.

More Awesome Infrastructure

Keep up with new infrastructure projects as they’re added to awesome-infra. New submissions are welcome!

OneTable - OneTable is an open source project that provides omni-directional interoperability between lakehouse table formats such as Apache Hudi, Apache Iceberg, and Delta Lake.

I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list.