Data Lakehouse Catalog Reality Check
Databricks and Snowflake are talking a big game. So far, they've given us empty Github repositories and rewrites.
I began my series on data lakehouse catalogs just last week and the news keeps rolling in. This week, Fivetran announced their managed data lake service, and Onehouse announced $35 million in series B funding.
In this post, I want to take a look at Unity and Polaris, the open source catalogs that Databricks and Snowflake recently announced. Both vendors launched their catalogs to great fanfare. Unfortunately, the marketing doesn’t yet seem to live up to reality.
Before continuing, readers should know that I owned Tabular shares. I don’t have any particular visibility into Databricks or Tabular’s product strategy, and I’ve tried to be as fair as possible in this post.
Snowflake has produced a great product landing page and blog post. I naively assumed that the project was, in fact, released—they link to a Github repo. A friend recently asked me, “Yes, but have you looked at the code?” This is the code.
On further inspection, Snowflake’s blog post does mention that developers should watch the Github repository to be notified when the code is released. I missed this. But I expected a bit more here, especially given the amount of marketing copy they’ve invested in.
Very quickly afterwards, Databricks announced Unity. Again, I was quite excited.
My excitement has waned. As it turns out, Unity is not Unity. Databricks open sourced an API-compatible rewrite of their product. And it sounds like it’s missing quite a bit. Sem Sinchenko breaks down the features in Unitycatalog: the first look:
At the time of this writing, Unitycatalog looks more like a proof of concept or MVP than a production-ready solution. There are no audit capabilities, no external RDBMS persistent storage support. All ML/AI governance features are currently missing. Big questions were raised about the lack of support for hive-style partitioning.
Shortly after all of this, I came across unity-rs. Yes, we’ve now got another Unity rewrite, this time in Rust. The project explains why it needs to exist. I don’t have a strong position on their points, but their server code caught my eye.
Deja vu. Out of all of these announcements, all this marketing, and all this noise, we’ve gotten one partial re-write, one empty Github project, and one hello world Rust file.
All of this bothers me a lot less than it used to. I believe unity-rs will get written, that Polaris will be released, and that Databricks will invest in Unity. In Databricks’s defense, it’s often hard to extricate internal projects. A rewrite is often the right move.
But it’s pretty strange to watch all of this play out. It appears the vendors have gotten ahead of themselves. I’m not even sure they all understand why they’re open sourcing these projects. I’d love to see someone write down their data lakehouse strategy, or at least do a better job of communicating what the end state of all this is. Right now it looks like flailing.
Other posts in this series are available here:
Book
Support this newsletter by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to someone.
Disclaimer
I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a ﹩ in this newsletter. See my LinkedIn profile for a complete list.