Databases Are Falling Apart: Database Disassembly and Its Implications

Why are engineers taking databases apart and putting them back together, again?

Jan 29, 2024

A recent trend in the database world is to break databases into their constituent components. Each component is provided on its own so infrastructure engineers can integrate them into databases.

Most databases have the same parts: a query parser, logical and physical planner, optimizer, write ahead log, client connection protocol, and so on. Each database usually implements their own version of these components. (Excepting storage engines like MySQL’s MyISAM/InnoDB, or RocksDB’s widespread use.)

Monolithic databases make sense. Each layer integrates with the layers above and below it; to abstract these layers is hard. Pluggable layers require a flexible API and support for different—often antithetical—use cases.

Yet most of these layers end up looking pretty much the same. A specific database might spend the majority of its innovation tokens in just one layer; say, the optimizer. But developers have historically had to re-implement each layer because there were limited off-the-shelf components. New open source projects are now building these components.

In this post, I discuss the history of database disassembly, the industry’s current state, where we’re heading, and the implications of this trend. I find it instructive to look at disassembly through the lens of two elephant-themed projects: Apache Hadoop and PostgreSQL. Though Hadoop and PostgreSQL are from different parts of the data stack, both have influenced modern disassembly efforts. Let’s start with Hadoop.

Hadoop’s Influence on Disassembly

Eighteen years ago, Hadoop broke the data warehouse into compute, data, and control planes—a paradigm that persists to this day.

Compute planes are responsible for running calculations; initially MapReduce. The data plane is responsible for providing storage; initially HDFS. Control planes are responsible for coordinating the deployment and execution of the compute; initially Hadoop’s JobTracker, then YARN.

Separate planes—often on separate computers—create a boundary where protocols and APIs could develop. Such developments are a prerequisite for a disassembled database.

The next step happened with storage formats. Hadoop users very quickly found they needed to write their data to HDFS in a file format. Users often started with CSV but quickly found that text parsing was slow. Early attempts to optimize storage formats led to Twitter’s Elephant Bird and LinkedIn’s Voldemort storage format. Subsequently, Apache Avro, Apache ORC, and Apache Parquet received widespread adoption, with Parquet ultimately winning (for now, anyway).

Further disassembly was spawned by Apache Hive and Apache Pig, which built atop MapReduce. These were query engines that converted text-based queries (SQL or Pig Latin) into MapReduce jobs to be run on Hadoop.

We’re now starting to see the contours of a disassembled database: a query engine (Hive/Pig) with a parser, a query plan, and an optimizer, which sits atop a query runtime (MapReduce). The query runtime reads from a data plane (HDFS) in an optimized storage format (Parquet).

This architecture is where we find ourselves today. Hive and Pig have been replaced by Presto, Apache Spark, and Trino. HDFS has been replaced by cloud object stores like S3 and GCS. Parquet lives on, though it’s used with Apache Iceberg or Delta Lake now. YARN is still in widespread use, but Kubernetes and its operators now dominate the control plane. Yet the architecture endures.

Disassembling the Query Engine

Current query engines like Trino are built as a fully integrated query engines with a parser, logical/physical query plan, optimizer, execution engine, and runtime. Engineers are now ripping these apart.

Language front ends parse text (SQL) and convert them into an intermediate representation (IR) such as a concrete syntax tree (CST), abstract syntax tree (AST), or logical or physical query plans. These intermediate representations make it easier for code to work with the queries than directly operating on the text.

Once the query engine has an intermediate representation, an optimizer replaces nodes in the IR with more optimal ones. Trino’s optimizer, for example, might decide that a hash-join is an optimal strategy.

After a query plan is optimized, it’s handed to the execution engine. Engines convert plans into tasks (like MapReduce tasks). Tasks are then executed by the execution runtime (MapReduce, Flink, Spark, and so on).

In practice, these layers are fuzzy. Optimization might happen at other layers, engines and runtimes might be combined, or perhaps a query engine has only one query plan or another. Still, the model above is a useful starting point.

Developers are now building composable libraries for each layer of the query engine. All query engines that integrate these libraries will benefit from the same set of optimization and feature work. And new databases can be quickly assembled to address new use cases as they arise; vector search being one recent example.

New open source projects now exist at every layer. SQL still dominates the frontend (despite the hate), but Malloy, PRQL, and SaneQL are projects worth looking at. Substrait has sprung up to provide an intermediate representation, and CMU-DB’s optd project is a composable optimizer. Further down the stack, Meta has open sourced Velox, an execution engine that’s now integrated into Presto and Spark. The runtime layer is by far the most mature, with options like Spark and Flink.

And then there’s Apache Arrow’s DataFusion subproject (soon to be a top-level Apache project). Unlike the projects above, which focus on a single layer, DataFusion does everything. You can use it as an integrated query engine like Trino or DuckDB, or you can use it as a library for any one of the layers above. It’s a complete database toolkit.

twitter.com/andygrove_io/status/1745295329468047461

It’s no surprise that many new databases are using DataFusion; it’s flexible and—I’m told—very easy to integrate. Adopters include GlareDB, Lance, ROAPI, Cube, InfluxDB, and dozens of others.

What about PostgreSQL?

While Hadoop has driven disassembly in data warehousing, PostgreSQL has done the same for relational databases (RDBMSs) and hybrid transactional/analytical processing (HTAP). Its robust storage layer, extension API, simple architecture, and open source Berkeley development model make it a shoo-in for disassembly.

PostgreSQL add-on projects run the gamut from SQL extensions to custom replication and storage schemes. Its protocol and syntax are also widely adopted, but it’s PostgreSQL’s storage layer—its data plane—that’s most significant to this discussion.

PostgreSQL’s storage layer is much more robust than Hadoop’s; it includes a write-ahead log (WAL), a vacuum process, and transactional guarantees. These features are helpful when building production databases that act as the source of truth for data (as opposed to data warehouses).

Many extensions add custom formats and indexes to PostgreSQL’s storage layer that are optimized for vector search, text search, graph queries, geospatial queries, online analytical processing (OLAP) queries, and a lot more.

Developers are treating PostgreSQL almost like DuckDB—an integrated query engine with a pluggable storage layer. Except, unlike DuckDB, PostgreSQL’s storage layer is far more mature (DuckDB deliberately doesn’t even document their storage format). Extensions automatically inherit all of the transactional goodness that PostgreSQL provides.

Projects like Neon have taken the storage layer a step further by ripping apart PostgreSQL’s internals. They’ve modified PostgreSQL to make PG’s write-ahead log (WAL) pluggable. Neon provides a remote implementation that uses their Paxos-based WAL with tiered object storage.

Neon architecture diagram — Neon’s architectural overview

Neon’s architecture is exemplary. I believe this is the most significant part of PG’s disassembly, and will be influential. Much like query engines, we’ll see projects that will provide read and write-through caches, consensus-based WALs and key-value stores, and tiered object storage. I’m not alone in this belief. Is Scalable OLTP in the Cloud a Solved Problem? describes exactly how this will look.

The Implications of Disassembly

Disassembly will impact the entire data ecosystem, from data warehouses to OLTP, HTAP, multi-model databases, and even streaming.

Data warehouse will become increasingly undifferentiated and commoditized. Jordan Tigani, MotherDuck’s CEO, recently wrote Perf is not enough. Jordan claims that database performance will converge over time; databases will compete on features and developer experience. I believe this commoditization will happen with features, too. The projects in the previous section commoditize performance, but they also make it easier to add, copy, or share new features. CMU-DB’s Meta Velox presentation reflects on this, too:

Meanwhile, new components in the storage layer will enable OLTP systems to finally achieve the dream: low cost, low latency, high throughput, multi-region, fully transactional databases. Current NewSQL systems tick only a subset of those boxes. As distributed WAL + S3 architecture gets commoditized (thanks to the projects listed in the previous section), costs should plummet. Neon is a my favorite example; TiKV’s S3 integration is another.

The final shoe to drop will be when OLAP data warehouses and OLTP databases are unified into HTAP and multi-model systems. As OLTP systems integrate with object stores, two new architectures surface:

OLTP systems can persist data in both row-based and columnar formats.
Separate OLTP and OLAP systems can interface via loosely coupled standards in the object store.

Both of these make it cheap and simple to service both row-based production queries and also column-based warehouse queries. The first architecture lends itself to integrated systems that service both OLTP and OLAP workloads. Thomas Neumann et al’s HyPer project is one example of this trend. SingleStore is another; Cloud-Native Transactions and Analytics in SingleStore is an excellent read.

The second architecture would see a plethora of query engines for different workloads. Each query engine would run off the same shared storage using storage formats optimized for their use cases. Loosely coupled systems will depend on open formats like Parquet, Iceberg, and Delta Lake to make integration possible.

A world centered around object stores is a serious threat to Apache Kafka. Kafka’s primary use case is data integration—moving data between systems. But if all of your data is on an object store from start (OLTP) to finish (OLAP, search, graph, etc.) what use have you for Kafka? Streaming systems will continue to be useful for sub-second use cases, but that’s a smaller market than data integration.

To continue to be relevant in data integration, Kafka should morph into a realtime ingestion system for object stores. This will require first-class integration with table formats like Apache Iceberg and Delta Lake. KIP-1009 hints at this with Kafka and Parquet integration (though, I believe there are better designs). WarpStream [$] is a trail blazer in this space, Confluent’s Kora is trending this way, and Kafka finally supports tiered object storage.

I’ve come this far without mentioning SQLite, the most successful database of all. SQLite, too, is being disassembled, and it’s having a significant impact on edge databases. Object storage, local storage, and caches are really just storage tiers. Developers are extending this tiering into the client side with SQLite and Conflict-Free Replicated Data Types (CRDT). Others are starting from scratch, leveraging libraries like DataFusion. Fly (via Litestream and LiteFS), SQLite Cloud, Turso, SKDB, and Ditto are but a few of the many projects that are making edge databases workable.

All of this change is going to take time, but the future looks bright. A world with purpose-built, low latency, high thoughput, multi-region, multi-model, transactional databases is coming; it’s just a matter of time. This is great news.

Support this newsletter by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to someone.

Buy Now

I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list.