Databases Should Speak Substrait

It's silly to have applications generate text-based SQL; they should be allowed to pass query plans to the database.

Feb 12, 2024

High winds during the bay area storm last week woke me up at 3am. While drifting back to sleep, I had an idea: databases should accept substrait query plans directly, rather than forcing users to provide text-based SQL queries.

Application code currently use object-relation mappings (ORMs) and fluent query builders to generate text SQL. The text query is sent to the database over the wire. Databases then parse the text using a language frontend. The frontend converts the query into an intermediate representation (IR). Velox's introduction video has a nice visualization of this:

SQL is built for humans. Machines are much better at generating structured messages. This is (one of) the reasons that developers use ORMs and query builders instead of concatenating text strings. You get type safety, testing, and other niceties.

Under the hood, query builders are all generating text SQL right now. I’ve always found this strange.

ORMs and query builders should be able to go straight to the IR (ORM → IR → wire). This would open up a few new possibilities:

Databases could support multiple query languages with different frontends (SQL, PRQL, Malloy, LLM-based plaintext queries, and so on)
Text parsing overhead would be eliminated. NOTE: Intuitively, it seems that encoding and parsing the IR should be faster than text parsing. IRs might be a larger payload than text-based SQL, though. It’s hard for me to be certain about performance tradeoffs between parsing and network.
Programmatic database access would become a first-class citizen. Fluent queries, ORMs, and DataFrame APIs could all directly interface with the database without having to use text-based SQL queries.

The tools already exist to do this. Substrait defines a standard set of compute operations. Applications can define logical relations like joins, filters, and sorts. Databases should accept substrait queries over the wire, and client libraries (ORMs, query builders, and DataFrame APIs) should generate and send substrait queries.

It appears that this is already happening in the data warehousing. Kudu supports adapters to generate queries. Spark connect translates DataFrame API calls directly into logical parse plans.

My favorite example is DuckDB, which has a substrait extension to interoperate with substrait. You can generate substrait query plans using `generate_substrait`, or you can also pass in query plans using `CALL from_substrait()`.

Surprisingly, I haven’t found any support for substrait in the OLTP space. (TiDB does have an open Github issue.) It makes sense to use substrait query plans for DataFrame APIs, but I expected ORMs and fluent queries to be a more natural starting point. Imagine if PostgreSQL had a `from_substrait()` extension. ORMs and fluent query builders like Prisma and Hibernate could generate substrait query plans and send them straight to PostgreSQL. Someone should build this.