<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Materialized View]]></title><description><![CDATA[Software infrastructure hot takes, projects, papers, developer interviews, and deep dives. Brought to you by Chris Riccomini.]]></description><link>https://materializedview.io</link><image><url>https://substackcdn.com/image/fetch/$s_!U8M8!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a9aa647-ffea-4b83-8a65-2f854d4e5de3_720x720.png</url><title>Materialized View</title><link>https://materializedview.io</link></image><generator>Substack</generator><lastBuildDate>Tue, 21 Apr 2026 12:05:41 GMT</lastBuildDate><atom:link href="https://materializedview.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Chris Riccomini]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[materializedview@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[materializedview@substack.com]]></itunes:email><itunes:name><![CDATA[Chris]]></itunes:name></itunes:owner><itunes:author><![CDATA[Chris]]></itunes:author><googleplay:owner><![CDATA[materializedview@substack.com]]></googleplay:owner><googleplay:email><![CDATA[materializedview@substack.com]]></googleplay:email><googleplay:author><![CDATA[Chris]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[This MCP Server Could Have Been a JSON File]]></title><description><![CDATA[There's a lot of buzz around MCP. I'm not convinced it needs to exist.]]></description><link>https://materializedview.io/p/mcp-server-could-have-been-json-file</link><guid isPermaLink="false">https://materializedview.io/p/mcp-server-could-have-been-json-file</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Thu, 11 Sep 2025 10:03:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ea4e5f0e-c391-4d40-88ab-9c6b23c1ff23_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://modelcontextprotocol.io">Model context protocol (MCP)</a> servers are very buzzy right now. The idea is simple: teach <a href="https://en.wikipedia.org/wiki/Large_language_model">large language models (LLMs)</a> how to interact with other software systems. In doing so, LLMs can learn from and affect the real world; they can call a web service to make a phone call, invoke a <a href="https://en.wikipedia.org/wiki/Command-line_interface">command line interface (CLI)</a> tool to add an item to a grocery list in a reminder app, and so on. To make such calls, LLMs must know what software it can call and how to do so. This is the problem that MCP solves: it informs LLMs of available software, teaches the LLM how to use it, and provides an avenue through which the LLM can call the software.</p><p>Developers write MCP servers that provide <em>resources</em>, <em>prompts</em>, and <em>tools</em> to the LLM. The MCP site <a href="https://modelcontextprotocol.io/docs/learn/server-concepts">discusses these concepts in detail</a>, but the <a href="https://modelcontextprotocol.io/docs/develop/build-server#core-mcp-concepts">Core MCP Concepts</a> section provides a summary:</p><blockquote><ol><li><p><strong><a href="https://modelcontextprotocol.io/docs/learn/server-concepts#resources">Resources</a></strong>: File-like data that can be read by clients (like API responses or file contents)</p></li><li><p><strong><a href="https://modelcontextprotocol.io/docs/learn/server-concepts#tools">Tools</a></strong>: Functions that can be called by the LLM (with user approval)</p></li><li><p><strong><a href="https://modelcontextprotocol.io/docs/learn/server-concepts#prompts">Prompts</a></strong>: Pre-written templates that help users accomplish specific tasks</p></li></ol></blockquote><p>These categories are arbitrary and confusing. At first blush, it seems resources are read-only and tools are write-only. But the MCP server documentation uses <a href="https://modelcontextprotocol.io/docs/learn/server-concepts#how-tools-work">searchFlights</a> as their tool example&#8212;a read-only operation. Even more baffling, they later show flight searching as a resource. Here&#8217;s their tool definition:</p><pre><code>{
  name: "searchFlights",
  description: "Search for available flights",
  inputSchema: {
    type: "object",
    properties: {
      origin: { type: "string", description: "Departure city" },
      destination: { type: "string", description: "Arrival city" },
      date: { type: "string", format: "date", description: "Travel date" }
    },
    required: ["origin", "destination", "date"]
  }
}</code></pre><p>And here&#8217;s their resource definition:</p><pre><code>{
  "uriTemplate": "travel://flights/{origin}/{destination}",
  "name": "flight-search",
  "title": "Flight Search",
  "description": "Search available flights between cities",
  "mimeType": "application/json"
}</code></pre><p>Prompts are simply <a href="https://modelcontextprotocol.io/docs/learn/server-concepts#example%3A-streamlined-workflows">static JSON definitions</a> that describe potential user activities.</p><p>The whole protocol feels off to me. Prompts are just static documentation, resources are static URL definitions, and tools look like <a href="https://en.wikipedia.org/wiki/Remote_procedure_call">remote procedure call (RPC)</a> definitions. I asked <a href="https://openai.com/index/introducing-gpt-5/">ChatGPT 5 Thinking</a> to convert the <code>searchFlights</code> tool definition to an <a href="https://swagger.io/specification/">OpenAPI definition</a> (an actual RPC definition). Unsurprisingly,  it worked just fine:</p><pre><code>openapi: 3.0.0
info:
  title: Flights API
  version: "1.0.0"
paths:
  /searchFlights:
    get:
      operationId: searchFlights
      summary: Search for available flights
      description: Search for available flights
      parameters:
        - in: query
          name: origin
          required: true
          description: Departure city
          schema:
            type: string
        - in: query
          name: destination
          required: true
          description: Arrival city
          schema:
            type: string
        - in: query
          name: date
          required: true
          description: Travel date
          schema:
            type: string
            format: date
      responses:
        "200":
          description: Search results
          content:
            application/json:
              schema:
                type: array
                items:
                  type: object</code></pre><p>This begs the question: why do we need MCP for tool definitions? We already have <a href="https://www.openapis.org/">OpenAPI</a>, <a href="https://grpc.io/">gRPC</a>, and CLIs. ChatGPT understands OpenAPI definitions. Millions of web services already provide OpenAPI definitions, too. And ChatGPT is actually better at CLIs than most humans&#8212;just watch <a href="https://openai.com/codex/">Codex</a> fly through sed, awk, and grep commands. A friend recently informed me that they successfully taught ChatGPT to use <a href="https://github.com/tmux/tmux/wiki">tmux</a>. <a href="https://mariozechner.at/posts/2025-08-15-mcp-vs-cli/">MCP vs CLI: Benchmarking Tools for Coding Agents</a> reflects this sentiment:</p><blockquote><p>MCP vs CLI truly is a wash: Both <code>terminalcp</code> MCP and CLI versions achieved 100% success rates. The MCP version was 23% faster (51m vs 66m) and 2.5% cheaper ($19.45 vs $19.95).</p></blockquote><p>I&#8217;ve seen <a href="https://bsky.app/profile/did:plc:bjaceaxgoto5p7gor2rcov3g/post/3lxdmbcfsx226?ref_src=embed">several arguments</a> made to justify MCP&#8217;s existence:</p><ol><li><p>LLMs have a limited context window. OpenAPI documentation takes up too much context space.</p></li><li><p>Many services are not well documented; they don&#8217;t come with an API spec.</p></li><li><p>LLMs need a way to discover what tools are available to it.</p></li></ol><p>I&#8217;m skeptical. Perhaps MCP does allow us to squeeze a few more tools into the context window. Perhaps it does let models run a little faster. OpenAPI documents are indeed somewhat verbose. But for how long will this matter?</p><p>Last year we were talking about 1 million token models. Now, we have <a href="https://x.com/OpenRouterAI/status/1964128504670540264">2 million token models</a> in <a href="https://openrouter.ai/">OpenRouter</a>. Smaller models, fine tuning, and other research also continues apace. Viewed from this lens, MCP seems like a temporary kludge.</p><p>The final argument that many services have poor documentation is true for internal enterprise services. <a href="https://www.linkedin.com/in/jakemannix">Jake Mannix</a> makes this point in <a href="https://bsky.app/profile/yetanotheruseless.com/post/3lxdmbcpuac26">his recent thread</a>. I hear that MCP is most widely adopted for this segment.</p><p>Customer-facing SaaS APIs are a different story. Most <a href="https://docs.github.com/en/rest?apiVersion=2022-11-28">are</a> <a href="https://docs.stripe.com/api">remarkably</a> <a href="https://plaid.com/docs/api/">well</a> <a href="https://auth0.com/docs/api/authentication">documented</a>. Some are complex, but these are often <a href="https://www.sciencedirect.com/topics/computer-science/inherent-complexity">inherent complexities</a> (I can attest to this after working with <a href="https://developer.wepay.com/api/">WePay&#8217;s API</a>).</p><p>Moreover, the argument seems to be that the same people that wrote the bad OpenAPI specs are going to write good MCP specs. I don&#8217;t buy it. (And again, even if they <em>could</em> do this, why not write the endpoint as an OpenAPI endpoint or a CLI tool?)</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HJgE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HJgE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png 424w, https://substackcdn.com/image/fetch/$s_!HJgE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png 848w, https://substackcdn.com/image/fetch/$s_!HJgE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png 1272w, https://substackcdn.com/image/fetch/$s_!HJgE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HJgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png" width="684" height="196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:196,&quot;width&quot;:684,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43375,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/172000158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HJgE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png 424w, https://substackcdn.com/image/fetch/$s_!HJgE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png 848w, https://substackcdn.com/image/fetch/$s_!HJgE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png 1272w, https://substackcdn.com/image/fetch/$s_!HJgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faded91f2-44a4-4e2b-a7f6-cb171417897f_684x196.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://x.com/ianlivingstone/status/1965768133752262863">View Post</a></figcaption></figure></div><p>As for discovery, I accept that LLMs need to know where tools are and how to use them. But this is static content. We already have <a href="https://agents.md/">AGENTS.md</a>, <a href="http://.github/instructions">.github/instructions</a>, <a href="https://swagger.io/specification/#openapi-description-structure">openapi.json</a>, and so on. </p><p><a href="https://x.com/sriramsubram/status/1960366044209467875">Developers</a> <a href="https://x.com/spencershum/status/1960379872490234342">are</a> <a href="https://x.com/kylemathews/status/1960365038918656093">waking</a> up to this. <a href="https://getbruin.com/">Bruin</a>&#65129; is using MCP <a href="https://x.com/burakkarakann/status/1960391792202772956">just to expose documentation</a>; their tool calls are done through normal CLI commands. <a href="https://www.donobu.com/">Donobu</a>&#65129; simply provides an OpenAPI spec. Both solutions work. Meanwhile, MCP&#8217;s Google trend looks bleak.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMux!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMux!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png 424w, https://substackcdn.com/image/fetch/$s_!cMux!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png 848w, https://substackcdn.com/image/fetch/$s_!cMux!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png 1272w, https://substackcdn.com/image/fetch/$s_!cMux!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMux!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png" width="1142" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1142,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50041,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/172000158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cMux!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png 424w, https://substackcdn.com/image/fetch/$s_!cMux!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png 848w, https://substackcdn.com/image/fetch/$s_!cMux!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png 1272w, https://substackcdn.com/image/fetch/$s_!cMux!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7b8f6f-0253-49e7-82c7-23309e34ad02_1142x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://trends.google.com/trends/explore?geo=US&amp;q=mcp&amp;hl=en">Google Trends &#8220;mcp&#8221; Interest Over Time</a></figcaption></figure></div><p>I don&#8217;t blame the MCP authors for this mess. Things happen fast in the AI ecosystem. MCP is a victim of its own success. <a href="https://worksonmymachine.ai/p/mcp-an-accidentally-universal-plugin">MCP: An (Accidentally) Universal Plugin System</a> does a good job explaining the situation.</p><p>We need to take a step back and think about what we&#8217;re trying to accomplish. There&#8217;s no law that LLMs need a new protocol to interact with software. In most cases, we have what we need. Where we don&#8217;t, we should write CLIs, web services, and documentation using existing standards.</p><blockquote><p>My takeaway? Maybe instead of arguing about MCP vs CLI, we should start building better tools. The protocol is just plumbing. What matters is whether your tool helps or hinders the agent's ability to complete tasks.</p><p>&#8212;Mario Zechner, <a href="https://mariozechner.at/posts/2025-08-15-mcp-vs-cli/">MCP vs CLI: Benchmarking Tools for Coding Agents</a></p></blockquote><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[DataFrames, Multi-Engine Queries, and Xorq With Hussain Sultan]]></title><description><![CDATA[Hussain Sultan, the co-founder of Xorq, discusses the broken DataFrame ecosystem and how Xorq is fixing it.]]></description><link>https://materializedview.io/p/dataframes-multi-engine-queries-and</link><guid isPermaLink="false">https://materializedview.io/p/dataframes-multi-engine-queries-and</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Tue, 02 Sep 2025 11:07:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c7811e74-5d18-4858-b886-71a37b0e5e16_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my <a href="https://materializedview.io/p/kafka-end-of-beginning">previous post</a>, I mentioned that <a href="https://martin.kleppmann.com/">Martin</a> and I were working on <a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/">Designing Data Intensive Applications</a>&#8217;s batch chapter. The chapter is now available in pre-release on <a href="https://www.oreilly.com/publisher/safari-books-online/">Safari Online</a>, and (among other things) includes a section on <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html">DataFrames</a>.</p><p>While working on the DataFrame section, I was surprised to find how bifurcated the ecosystem remains. On one side, you have single-node libraries like R and <a href="https://pandas.pydata.org/">Pandas</a>. On the other side, you have bolt-on Pandas-compatible APIs for distributed execution engines and cloud data warehouses such as <a href="https://spark.apache.org/">Apache Spark</a>&#8217;s <a href="https://spark.apache.org/docs/latest/api/python/index.html">pyspark</a> and <a href="https://www.snowflake.com/en/">Snowflake</a>&#8217;s <a href="https://www.snowflake.com/en/product/features/snowpark/">Snowpark</a>. It seems to me that a more unified approach is needed.</p><p>This post is an interview with <a href="https://www.linkedin.com/in/hussainsultan/">Hussain Sultan</a>, the co-founder and CEO of <a href="https://www.xorq.dev/">Xorq</a>&#65129;. We discuss the fragmented DataFrame ecosystem and how Xorq bridges the gap with its multi-engine DataFrame library built on <a href="https://ibis-project.org/">Ibis</a>, <a href="https://datafusion.apache.org/">Apache DataFusion</a>, and <a href="https://arrow.apache.org/">Apache Arrow</a>. Hussain has a diverse background that spans electrical engineering, <a href="https://en.wikipedia.org/wiki/Digital_signal_processing">digital signal processing (DSP)</a>, data science, machine learning, and field engineering. I&#8217;m so excited about what Hussain and Xorq are doing that I&#8217;ve joined their most recent funding round.</p><div><hr></div><p><em><strong>C.R: Thanks for taking the time to talk. I'm interested in <a href="https://www.linkedin.com/in/hussainsultan/">your career path</a>. I noticed you started as a hardware engineer before falling into the data space. How did you go from a background in digital signal processing to founding Xorq?</strong></em></p><p>H.S.: Appreciate you asking&#8212;this has definitely not been a straight line. I&#8217;ve always been a builder. I had early access to a computer, picked up a <a href="https://en.wikipedia.org/wiki/Visual_Basic_for_Applications">Visual Basic for Applications (VBA)</a> book before I knew what code really was, and hacked on WordPress sites in college. But knowing I could always fall back on software gave me the confidence to dive into other domains&#8212;hardware, embedded systems, DSP&#8212;where the problems felt more physical, more constrained.<br><br>Electrical engineering gave me the tools to think in systems. I worked on turbine blade monitoring using <a href="https://www.ni.com/en/shop/labview.html">LabVIEW</a> and <a href="https://en.wikipedia.org/wiki/Field-programmable_gate_array">Field-programmable gate arrays (FPGAs</a>)&#8212;very visual, very declarative. It was kind of like building a <a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">directed acyclic graph (DAG)</a>, just with wires. That led me into DSP, and eventually <a href="https://www.mayoclinic.org/tests-procedures/eeg/about/pac-20393875">electroencephalogram (EEG)</a> denoising in grad school. EEGs were fascinating because they were a deeply human signal&#8212;full of noise, uncertainty, context. But in <a href="https://www.mathworks.com/products/matlab.html">MATLAB</a> everything felt so perfect, so sandboxed. I wanted data that was messier, more grounded in the real world, perhaps even macro behavior.<br><br>That&#8217;s what pulled me toward other types of data. I got fascinated by the idea that how people transact and spend money reveals something real about human behavior&#8212;at scale. Purchase behavior is just another noisy signal, but at a macro level. So I went deep into consumer models, credit risk, customer behavior etc.<br><br>That&#8217;s when I ran into real-world noise: <a href="http://sas.com/">SAS jobs</a>, VBA macros, SQL you couldn&#8217;t trace, pipelines nobody owned. The business logic&#8212;<a href="https://www.investopedia.com/terms/n/npv.asp">net present value (NPV)</a> models, <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo</a> simulations, decision rules&#8212;was always trapped in brittle systems. And I knew there had to be a better way.</p><p>I turned to open source Python out of necessity: Pandas, <a href="https://www.dask.org/">Dask</a>, and Arrow. When I found the <a href="https://pydata.org/">PyData</a> community, it felt like <a href="https://wordpress.com/">WordPress</a> again: smart people solving hard problems together, in the open.<br><br>Over the years, I took that mindset into big tech, avionics systems, and consulting for banks trying to modernize. We saved millions just by building reusable, inspectable pipelines. But I kept running into the same issue: compute wasn&#8217;t composable, but was rapidly being commoditized by things like <a href="https://duckdb.org/">DuckDB</a> and DataFusion. You couldn&#8217;t trace them, reuse them, or move them across engines&#8230;easily. It wasn&#8217;t an infra problem&#8212;it was an abstraction problem. Data has structure. Compute didn&#8217;t.<br><br>That&#8217;s what led to Xorq. With Arrow, Ibis, and DataFusion, the primitives were finally ready. Xorq is a compute catalog: a place to put the logic itself. Expressions become portable, observable, and reusable. Multi-engine. Storage-agnostic. Versioned by default.</p><p>So no, this wasn&#8217;t some grand plan. I just kept chasing messy signals, building tools to make sense of them, and leaning on the communities that made that work possible.</p><p><em><strong>C.R.: What does the compute logic in Xorq&#8217;s &#8220;compute catalog&#8221; look like? Are we talking about operators that can be run in a physical query plan executor&#8212;things like a &#8220;sort-merge join&#8221; operator? Or are these more bespoke pieces of logic? Some examples might help me understand more.</strong></em></p><p>H.S.: Short answer: we don't deal with physical operators&#8212;you won&#8217;t see &#8220;sort-merge join&#8221; in our plan. What we store is a <em>relational + functional + semantic plan</em> that&#8217;s declarative, portable, and able to run across different engines.</p><p>You can think of this compute logic as a &#8220;high-level plan&#8221; that stitches together many logical plans for engines running in different places&#8212;DuckDB, DataFusion, Pandas, and so on. The bespoke logic shows up as Python <a href="https://datafusion.apache.org/python/user-guide/common-operations/udf-and-udfa.html">UDFs and UDAFs</a>, and we leaned on DataFusion because it treats Arrow-based functions as first-class. At this level we also track tags, caching, and pipeline operators that move data between engines. Once the plan is executed into a backend, that&#8217;s when the physical choices&#8212;hash vs. sort-merge join, vectorization, parallelism&#8212;get made. We generally leave that up to the engine. Things like the <a href="https://docs.xorq.dev/how_to/into_backend_caching.html#backend-switching-with-into_backend">into_backend</a> operator to go from one engine to another and <a href="https://docs.xorq.dev/reference/flight_udxf.html">flight_udxf</a> nodes for arbitrary computations appear often instead.</p><p>Since multi-engine workloads heavily lean on deterministic caching, we build a stable hash of the plan that can be used with a cache operator. If two subgraphs are the same, they map to the same cache key and return identical results, regardless of which engine runs them.</p><p>A concrete example: take a simple <a href="https://scikit-learn.org/">scikit-learn</a> pipeline. We lift it into Xorq&#8217;s <a href="https://en.wikipedia.org/wiki/Intermediate_representation">intermediate representation (IR)</a>, split it into train/test, add cache boundaries, and tag each step with metadata (step name, parameters, features, target). Under the hood, <em><a href="https://scikit-learn.org/stable/glossary.html#term-fit">fit</a></em> lowers to a UDAF that produces state (weights, scalers, encoders), and <em><a href="https://scikit-learn.org/stable/search.html?q=apply">apply</a></em> lowers to a scalar UDF that consumes that state row by row (normalize, lookup, predict). Xorq uses <a href="https://ibis-project.org/">Ibis</a> heavily for its expression system and its the de facto user-facing IR.</p><p>So in this plan, the &#8220;predicted&#8221; column is literally a scalar UDF parameterized by the fitted model executed in the DataFusion engine. Reads flow through DuckDB, ML bits get routed to DataFusion and UDFs, and from the developer&#8217;s side it&#8217;s just Python (`expr.py` with `.fit(...)`, `.predict(...)`, tagging, and caching). Xorq then emits a build directory so the whole thing can be reproduced.</p><p>Because the entire workflow is captured in this relational algebra, we can lower ordinary Python pipelines into SQL engines, or extend them with arbitrary compute via `flight_udxf`. That&#8217;s what makes it a &#8220;high-level plan&#8221;&#8212;broader than a database logical plan, and able to tie together engines outside the SQL world (vector databases, graph systems like KuzuDB, etc.) and of course, coupled with a user-facing expression system in Ibis.</p><p>A <a href="https://asciinema.org/a/733941">quick recording</a> is available if you&#8217;d like to see it in action; <a href="https://gist.github.com/hussainsultan/d39f4dfeb49ef7aeaa6c503aede03892">here&#8217;s a gist</a> with the complete build outputs, which might help readers understand it more concretely.</p><p><em><strong>C.R.: Wow, OK, this is pretty slick. You mentioned that you re-use identical results if two subgraphs are the same. Does this work across queries, workflows, or even users as well? Also, how do you handle cache invalidation, where a subsequent query might want fresher data than the cache contains?</strong></em></p><p>H.S.: It does&#8212;across queries, across workflows, and, within a workspace and its permissions, across users. The way we make that safe and predictable is by tackling the old, thorny problem of &#8220;naming things.&#8221; If you can name a computation deterministically, you can look it up later and recover its full lineage to the source. This isn&#8217;t a new idea; you see versions of it in content&#8209;addressed storage and in Nix.</p><p>My co&#8209;founder, <a href="https://www.linkedin.com/in/dan-lovell-9226554/">Dan</a>, and I have lived through this pain for years. Every time we&#8217;d run a new experiment, we&#8217;d save the training dataset and the splits, and then were left juggling with naming datasets and&#8212;even harder&#8212;when to invalidate the cache. Nix gave us a clean mental model: it hashes, builds deterministically and leverages binary caches. If two people produce the same build, Nix can tell you whether a valid artifact already exists or whether it needs to be rebuilt.</p><p>Xorq applies that principle to data compute. We derive a cache key for any expression by hashing the contents of the expression itself along with the relevant source descriptors. Depending on the caching strategy, we either incorporate a change signal&#8212;like a table&#8217;s last&#8209;modified time&#8212;or we pin to a snapshot. If the upstream table changes and you&#8217;re in a &#8220;follow the source&#8221; mode, the stamp changes, the hash flips, and the cache is naturally bypassed so the computation is re&#8209;materialized. If you&#8217;re in snapshot mode, the cache key stays stable until you explicitly ask for something fresher.</p><p>To make reuse work across users without surprises, the hash has to be stable. That means we strip away environment&#8209;specific noise and avoid anything brittle like absolute local file paths. The result is a content&#8209;addressed artifact that&#8217;s consistent regardless of who computes it, so long as they&#8217;re looking at the same logical expression and the same view of the source.</p><p>Because "cache" is just another node in the relational graph, you can attach it wherever it makes sense. You might cache a heavy transformation to accelerate iteration but choose not to cache a subsequent training step. The choice is surgical and composable.</p><p>An added benefit of stably hashing every node is discovery. We can compare expression nodes across different graphs&#8212;even across teams&#8212;and detect shared subgraphs. That lets us surface high&#8209;value candidates for caching and reuse across an organization, not just within a single project.</p><p>On freshness guarantees, especially for near&#8209;real&#8209;time features like &#8220;activity_last_5m,&#8221; we combine strategies. We can materialize the feature in snapshot mode first&#8212;so we&#8217;re not chasing source modification times constantly&#8212;and then apply a time&#8209;to&#8209;live at retrieval. When the TTL expires, we recompute. That pattern will feel familiar to anyone who&#8217;s worked with feature stores for ML workloads.</p><p>So the short version is: reuse works across queries, workflows, and (with proper access controls) users, because everything is deterministically named. Invalidation is built into that identity&#8212;either via source&#8209;change stamps for automatic freshness or via TTL&#8209;driven policies when you want precise control.</p><p><em><strong>C.R.: Hah, you walked right into my next question! I am curious if you&#8217;ve considered how streaming and realtime data fit into Xorq. Many other processing frameworks (Spark, <a href="https://flink.apache.org/">Apache Flink</a>, <a href="https://www.feldera.com/">Feldera</a>) are working towards a unified API for both modalities. It sounds like Xorq is pretty batch-focused. Are there any plans to add streaming or &#8220;micro-batch&#8221; support?</strong></em></p><p>H.S.: You&#8217;re right&#8212;most of our current focus is on batch engines. I usually describe Xorq&#8217;s architecture as out-of-core i.e. we stream Arrow <a href="https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html">RecordBatch</a> fragments between operators instead of materializing everything up front. That lets us handle datasets larger than memory while still feeding engines directly. The key distinction here is bounded vs. unbounded. Out-of-core workloads are bounded&#8212;they terminate&#8212; whereas micro-batch or real-time streaming engines operate over unbounded streams. DataFusion, our embedded engine, has also been a solid building block for unified models and even some emerging streaming use cases.</p><p>We&#8217;ve been actively experimenting with streaming engines like Flink because our expression system, through Ibis, can already describe watermark- and time-aware operations. The real difference comes at the source: if you attach a bounded table, you&#8217;re still in a mini-batch world; if you attach an unbounded source like Kafka or CDC, the exact same expression can be evaluated incrementally by the engine. Official Flink support is on our mid-term roadmap, but even in these early experiments it&#8217;s clear how naturally our model extends to streaming SQL. This is slated for mid-term priority on our roadmap and would be an indicator that we feel confident with batch use-cases and ready to tackle streaming/real-time engines. I am also super psyched about engines like <a href="https://www.feldera.com/">Feldera</a>, since it turns the SQL that we know for batch cases and makes it incremental without changes. A unified API for both modalities is most welcomed!</p><p>What&#8217;s exciting is that the same caching and reuse story still holds. We can incorporate Flink&#8217;s operator state into our hashing model, which makes streaming computations just as stable and discoverable as batch ones. </p><p>This ties directly into how we think about freshness guarantees in research vs. production. In a research notebook, you might snapshot a feature like &#8220;activity_last_5m,&#8221; attach a TTL, and let the cache refresh when it expires. It&#8217;s reproducible and easy to debug. But in production, you want that feature continuously updated without extra orchestration. </p><p>And that&#8217;s really the arc: in research, you rely on explicit snapshots and cache nodes; in production, you lean on streaming engines and policies that manage freshness automatically.</p><p><em><strong>C.R.: Similarly, how do you see Xorq fitting in with AI training and inference workloads?</strong></em></p><p>H.S.: Xorq works great with AI/ML training and inference, but for tabular data. We model opaque compute with a <a href="https://docs.xorq.dev/reference/flight_udxf.html">flight_udxf</a> node, executed as an <a href="https://arrow.apache.org/docs/format/Flight.html">Apache Arrow Flight</a> <a href="https://arrow.apache.org/docs/python/generated/pyarrow.flight.FlightClient.html#pyarrow.flight.FlightClient.do_exchange">do_exchange</a> endpoint. This allows us to remain in a relational authoring surface&#8212;DataFrames and SQL&#8212;while offloading to tensor runtimes as opaque operations, all without leaving the plan.</p><p>Apache Arrow provides the typing system and transport layer. Crossing the boundary is efficient: Arrow arrays convert to <a href="https://github.com/dmlc/dlpack">DLPack</a> and back with minimal overhead. Because training and inference typically run in mini-batches, our out-of-core streaming backbone remains unchanged, and relational engines compose seamlessly with arbitrary compute.</p><p>For scikit-learn&#8211;style machine learning, our goal was to unify training and inference into a single relational graph instead of two ad-hoc scripts. We achieve this with UDF/UDAF machinery in DataFusion: &#8220;fit&#8221; lowers to an aggregate UDF that emits state (weights, scalers, vocabularies) as bytes or artifact references; &#8220;apply&#8221; lowers to a UDF that consumes that state row-wise for predictions. Because DataFusion operates directly on Arrow RecordBatches, the pattern is fast in practice. And we can make these UDFs portable with Arrow Flight.</p><p>However, one key lesson is that DataFrame APIs excel for feature engineering and early-stage preprocessing, but late-stage preprocessing nearly always shifts to tensor operations; think <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html">nn.Module</a>/<a href="https://docs.pytorch.org/tensordict/stable/index.html">TensorDict</a>, dense normalization, tokenization etc.. The next step is to make this &#8220;last mile&#8221; feel like <a href="https://pytorch.org/">PyTorch</a>, instead of DataFrame-ey, while keeping the entire workflow declarative, optimizable, and reproducible. This will also unlock new workloads with unstructured data, videos, images etc.</p><p>Long-term, we are inspired by the <a href="https://www.vldb.org/pvldb/vol16/p3461-jungmair.pdf">Declarative Sub&#8209;Operators paper</a>: expose a lower&#8209;level, declarative layer beneath relational algebra so we can express things like tensor-prep. I am excited for the declarative future, where Tensor and DataFrame APIs can be used interchangeably.</p><p><em><strong>C.R.: You&#8217;ve described so many directions that Xorq can go in. There&#8217;s clearly many years of work ahead; I look forward to seeing it evolve. In the meantime, it seems like a pretty solid batch multi-engine system. I really appreciate you taking the time to talk with me. I&#8217;ll give you the final word. Anything you&#8217;d like to add? Where should folks go if they&#8217;re interested in learning more?</strong></em></p><p>H.S.: It&#8217;s a big vision, made possible only by standing on the shoulders of Apache Arrow, Ibis, and DataFusion. Across the ecosystem, teams are building semantic layers, generating request-time dynamic SQL, and with Xorq, making ML accessible to data and analytics engineers. In the multi-engine batch processing world, our early design partners report speeding up dynamic <a href="https://www.getdbt.com/">dbt</a>-style workloads by 5-15x versus Jinja-based SQL model generation, cutting infrastructure costs by &gt;10x compared with Snowpark, while providing end-to-end column-level lineages and greater transparency. This fits since our target audience tends to be multi-stakeholder organizations that span tech, business, and data. This means, we need to thread value across various stakeholders i.e. cost of ownership, time to value and policy-based governance.</p><p>Our catalog-first approach makes reuse, optimization, and governance first-class, while staying minimally invasive to how teams already run on SQL engines. We are super keen to get feedback from the community on the "compute catalog" concept. If this resonates, kick the tires and tell us what&#8217;s missing:</p><ul><li><p>GitHub: <a href="https://github.com/xorq-labs/xorq">https://github.com/xorq-labs/xorq</a></p></li><li><p>Docs: <a href="https://docs.xorq.dev">https://docs.xorq.dev</a></p></li><li><p>Website: <a href="https://xorq.dev">https://xorq.dev</a></p></li></ul><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[I'm Still Here! Let's Catch Up.]]></title><description><![CDATA[The latest on Designing Data-Intensive Applications, SlateDB, AI, Materialized View Capital, and forthcoming newsletter posts.]]></description><link>https://materializedview.io/p/im-still-here-lets-catch-up</link><guid isPermaLink="false">https://materializedview.io/p/im-still-here-lets-catch-up</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Wed, 27 Aug 2025 18:26:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f2d43011-1222-4362-8061-1f2dedaa78e8_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>My apologies for the lack of posts recently. I&#8217;ve been hard at work on a few things over the summer, which has left little time for this newsletter. I&#8217;m getting my bearings now. More posts are back on the docket. My first post&#8212;this one&#8212;will be a quick catch-up.</p><p><a href="https://martin.kleppmann.com/">Martin</a> and I have been working on <a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/">Designing Data-Intensive Applications</a>&#8217;s batch and streaming chapters for its second edition. The batch chapter has been published to Safari Online as an early release; it required a full rewrite. The streaming chapter is still a work-in-progress. <a href="https://materializedview.io/p/kafka-end-of-beginning">Kafka: The End of the Beginning</a>, my previous post on this newsletter, reflected on how stagnant the streaming ecosystem has been over the past 10-15 years. The updates to the streaming chapter reflect this; they&#8217;ll be more minimal. I do plan to add a section on <a href="https://wiki.postgresql.org/wiki/Incremental_View_Maintenance">incremental view maintenance (IVM)</a>, which I&#8217;ll base off of my IVM newsletter post.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;4f92f30d-a3a6-40b0-a947-e24a0c0bc092&quot;,&quot;caption&quot;:&quot;Incremental view maintenance has been a hot topic lately. Materialize has been around for a while, but newcomers like PostgreSQL&#8217;s (semi-working) pg_ivm extension, Feldera, Epsio, Bytewax, and many others are starting to make noise. In the data warehousing space,&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Everything You Need to Know About Incremental View Maintenance&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:69592459,&quot;name&quot;:&quot;Chris&quot;,&quot;bio&quot;:&quot;Writer of books (themissingreadme.com), code (slatedb.io), checks (materializedview.capital), and newsletters (materializedview.io)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2e6fe04-e3a2-4cf7-a89f-8499175bced5_404x404.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-18T18:28:00.654Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7af30c05-2222-4c48-b90b-2aa67575e950_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://materializedview.io/p/everything-to-know-incremental-view-maintenance&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161506403,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:24,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Materialized View&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!U8M8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a9aa647-ffea-4b83-8a65-2f854d4e5de3_720x720.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>Meanwhile, <a href="https://slatedb.io">SlateDB</a> work continues apace. <a href="https://flaneur2020.github.io/">Li Yazhou</a> has <a href="https://github.com/slatedb/slatedb/issues/489">added</a> <a href="https://wiki.postgresql.org/wiki/SSI">serializable snapshot isolation (SSI)</a> support. He&#8217;s in the process of <a href="https://github.com/slatedb/slatedb/issues/816">adding transactions</a>. The RFC is <a href="https://github.com/slatedb/slatedb/blob/main/rfcs/0011-transaction.md">here</a> if you&#8217;d like to learn more. We also have both <a href="https://github.com/slatedb/slatedb/tree/main/slatedb-py">Python</a> and <a href="https://github.com/slatedb/slatedb/tree/main/slatedb-go">Go</a> bindings now. I have been focusing on refactoring and stability; I added a basic <a href="https://github.com/slatedb/slatedb/tree/main/slatedb-dst">deterministic simulation tester</a> recently. <a href="https://www.linkedin.com/in/sujeet-sawala-8b70a9144">Sujeet Sawala</a> is <a href="https://github.com/slatedb/slatedb/pull/695">working on an RFC</a> to persist compaction progress. We&#8217;re starting to see some <a href="https://github.com/slatedb/slatedb?tab=readme-ov-file#adopters">real adoption</a>. More projects are launching in the near future, too.</p><p>The biggest news with SlateDB, though, is <a href="https://github.com/Barre">Pierre Barre</a>&#8217;s <a href="https://www.zerofs.net/">ZeroFS</a> project. ZeroFS provides <a href="https://www.zerofs.net/nfs-access">network filesystem (NFS)</a>, <a href="https://www.zerofs.net/nbd-devices">network block device (NBD)</a>, and <a href="https://www.zerofs.net/9p-access">Plan 9 Filesystem Protocol (9P)</a> implementations on top of SlateDB. The filesystem is also <a href="https://github.com/Barre/ZeroFS/actions/workflows/pjdfstest.yml">POSIX compliant</a>&#8212;no small feat. On the performance front, check out Pierre&#8217;s <a href="https://www.zerofs.net/zerofs-vs-aws-efs">AWS EFS</a>, <a href="https://www.zerofs.net/zerofs-vs-mountpoint-s3">AWS Mountpoint-S3</a>, <a href="https://www.zerofs.net/zerofs-vs-juicefs">JuiceFS</a>, and <a href="https://www.zerofs.net/zerofs-vs-azure-files">Azure Files</a> benchmarks.</p><p>ZeroFS is a young project, but I&#8217;m very excited about it. SlateDB&#8217;s <a href="https://github.com/slatedb/slatedb/blob/main/rfcs/0004-checkpoints.md">branching and forking</a> features mean ZeroFS will be able to provide zero-copy filesystem forking&#8212;an important feature for AI and many other use cases.</p><p>Speaking of AI, I&#8217;m still getting my bearings with it. I&#8217;ve been reluctant to post about the topic because I don&#8217;t feel I&#8217;m an expert in the subject. (Then again, it&#8217;s so new that very few are.) I use coding agents constantly, though. As a user, I&#8217;ve begun to form some opinions around <a href="https://modelcontextprotocol.io/docs/getting-started/intro">model context protocol (MCP)</a>, agent adoption in the enterprise, and its impact on developer tooling and infrastructure. I plan to write more on AI in the near future.</p><p>I&#8217;ve continued to invest in startups throughout the summer. <a href="https://materializedview.capital/">Materialized View Capital</a> is now 75% deployed and will be fully deployed ~18 months from its inception. One usually targets a 3 year deployment. An 18 month deployment for a smaller fund like MVC is not unheard of. I&#8217;m quite pleased with our portfolio, which includes <a href="https://bauplanlabs.com/">Bauplan</a>, <a href="https://dosu.dev/">Dosu</a>, <a href="https://www.fiveonefour.com/">Fiveonefour</a>, <a href="https://withgauge.com/">Gauge</a>, <a href="https://loopholelabs.io/">Loophole Labs</a>, <a href="https://paradedb.com/">ParadeDB</a>, <a href="https://reboot.dev/">Reboot</a>, <a href="https://signadot.com/">Signadot</a>, <a href="https://spiraldb.com/">Spiral</a>, <a href="https://tigrisdata.com/">Tigris</a>, <a href="https://tensorlake.ai/">Tensorlake</a>, and <a href="https://materializedview.capital/">many more</a>.</p><p>Starting a fund has been rewarding. I plan to take a few months off after the fund is deployed. I&#8217;d like to evaluate what&#8217;s next for my startup investing adventure.</p><p>And that pretty much sums it up! I&#8217;m sure I&#8217;ve missed a few things. Let me know if there&#8217;s something specific that&#8217;s worth noting. In the meantime, expect an interview next week with <a href="https://xorq.dev/">Xorq</a>&#8217;s&#65129; CEO, <a href="https://www.linkedin.com/in/hussainsultan/">Hussain Sultan</a>.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[Kafka: The End of the Beginning]]></title><description><![CDATA[A decade of focus on adoption has paid off. Now it's time to innovate.]]></description><link>https://materializedview.io/p/kafka-end-of-beginning</link><guid isPermaLink="false">https://materializedview.io/p/kafka-end-of-beginning</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Fri, 30 May 2025 10:04:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/09554463-2619-45fd-addc-7586a2eccdd9_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You&#8217;ve probably noticed fewer posts recently. I am spending time working on the batch processing chapter in <a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/">Designing Data-Intensive Applications (2nd edition)</a>. It&#8217;s striking how much batch processing has changed since the book was written. The first edition was very focused on MapReduce. A lot has changed over the past 10-15 years: cloud data warehouses, data lakes, lakehouses, data catalogs, new storage formats, <a href="https://spark.apache.org/">Spark</a>, <a href="https://duckdb.org">DuckDB</a>, <a href="https://datafusion.apache.org/">DataFusion</a>, DataFrames, composable data systems, and more.</p><p>The pace of change in the batch processing ecosystem stands in stark contrast to its streaming counterpart. <a href="https://kafka.apache.org/">Apache Kafka</a> remains the near-universal solution for streaming data, while <a href="https://flink.apache.org/">Apache Flink</a> remains the dominant stream processing system. Both projects were started well over 10 years ago. Flink 1.0 was released in 2016, while Kafka 0.9 (widely adopted in production) was released in 2015. Even <a href="https://debezium.io/">Debezium</a> started around 10 years ago; we adopted 0.2 at WePay in 2016.</p><p>Undeniably, there has been some innovation. The separation of control, data, and compute planes has led to object store-backed streaming systems like <a href="https://www.warpstream.com/">WarpStream</a>&#65129;, <a href="https://buf.build/product/bufstream">Bufstream</a>, and <a href="https://www.redpanda.com/">Redpanda</a>. The object store transition also led to streaming lakehouse integrations like <a href="https://www.confluent.io/blog/introducing-tableflow/">Tableflow</a>. And <a href="https://en.wikipedia.org/wiki/Frank_McSherry">Frank McSherry</a> et al.&#8217;s work on timely and differential dataflow is nothing short of is brilliant.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;915dc9e6-d7b3-482a-a0fb-54b799605ff3&quot;,&quot;caption&quot;:&quot;Incremental view maintenance has been a hot topic lately. Materialize has been around for a while, but newcomers like PostgreSQL&#8217;s (semi-working) pg_ivm extension, Feldera, Epsio, Bytewax, and many others are starting to make noise. In the data warehousing space,&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Everything You Need to Know About Incremental View Maintenance&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:69592459,&quot;name&quot;:&quot;Chris&quot;,&quot;bio&quot;:&quot;Writer of books (themissingreadme.com), code (slatedb.io), checks (materializedview.capital), and newsletters (materializedview.io)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2e6fe04-e3a2-4cf7-a89f-8499175bced5_404x404.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-18T18:28:00.654Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7af30c05-2222-4c48-b90b-2aa67575e950_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://materializedview.io/p/everything-to-know-incremental-view-maintenance&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161506403,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:21,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Materialized View&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a9aa647-ffea-4b83-8a65-2f854d4e5de3_720x720.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>Yet <a href="https://www.confluent.io/resources/ebook/i-heart-logs-event-data-stream-processing-and-data-integration/">the architecture</a> that <a href="https://confluent.io">Confluent</a> was founded on&#8212;an enterprise service bus built on Apache Kafka&#8212;remains largely the same as it was 10 years ago. Moreover, the actual technologies we&#8217;re using are the same. Much of the story of the past 10 years has been about adoption rather than innovation. We&#8217;re still using Kafka (or at least its protocol), we&#8217;re still using Flink, and we&#8217;re still using Debezium.</p><p>I&#8217;ve been thinking about this stagnation for a while. I don&#8217;t believe we&#8217;ve solved everything. It&#8217;s still very difficult to write and deploy stream processing jobs, for example; it honestly feels like <a href="https://materializedview.io/p/from-samza-to-flink-a-decade-of-stream">it&#8217;s gotten worse, not better</a>. I&#8217;d like to believe that these challenges are not fundamental&#8212;that there are better ways.</p><p>This malaise seems to be growing. <a href="https://www.linkedin.com/in/sap1ens">Yaroslav Tkachenko</a> <a href="https://bsky.app/profile/did:plc:3745lfdummthckfijhgqdfvr/post/3lqd4fs4sk223">posted a summary</a> of last week&#8217;s <a href="https://current.confluent.io/london">Current 2025 conference</a> in London. The post is worth a read, but his comment at the top of the post mirrored my own view:</p><blockquote><p>Finally, I feel like the data streaming industry is still in a tough spot. The growth is slow, and the sales cycles are long. One person I spoke with said that &#8220;80% of the companies in the Expo hall will be dead in two years&#8221;. I don&#8217;t want to believe them, but it might be true.</p></blockquote><p>I was speaking with someone just yesterday that said something similar: it&#8217;s the same people, same companies, and same technology. There are no new ideas and no new users. This person said they would not attend future Current conferences; it wasn&#8217;t worth it; they didn&#8217;t learn anything new.</p><p>Yaroslav&#8217;s post also says that Redpanda was banned from the conference. I don&#8217;t know the details, nor do I really care. But if true, it&#8217;s hard not to read this as an indication of scarcity and fear, not one of abundance.</p><p>It&#8217;s worth reflecting on <a href="https://hadoop.apache.org/">Apache Hadoop</a>. Hadoop began at <a href="http://yahoo.com/">Yahoo!</a> in 2006. It took about 10 years for companies such as <a href="https://en.wikipedia.org/wiki/Hortonworks">Hortonworks</a> and <a href="https://www.cloudera.com/">Cloudera</a> to drive adoption through the enterprise. By 2016, it was widespread, but dated. This is when dataflow engines and cloud data warehouses began to replace Hadoop. Confluent was founded in 2014. Eleven years later, it feels like we might be at a similar turning point with Kafka; it&#8217;s widely adopted, but dated.</p><p>Yet, I don&#8217;t see the challengers that Hadoop saw in 2016. I suspect this is because Kafka&#8217;s protocol has become the de facto standard. <a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html">HDFS</a>, <em>including its protocol</em>, was largely replaced by object stores like S3. Kafka, on the other hand, managed to adopt object stores while keeping its protocol and client rebalancing architecture intact.</p><p>A de facto standard protocol is great for enterprises; they don&#8217;t need to migrate and vendors are all forced to compete on a commoditized storage layer. But what happens when the protocol needs to change? Entrenched vendors aren&#8217;t going to ask their customers to migrate.</p><p>Concepts like partitions and leadership are baked into Kafka&#8217;s protocol (especially its clients). Broker awareness of time&#8212;fundamental to stream processing&#8212;is largely absent. Kafka is not designed for millions (or billions) of streams. Gunnar Morling, of Debezium fame, has an excellent post where he asks, <a href="https://www.morling.dev/blog/what-if-we-could-rebuild-kafka-from-scratch/">&#8220;What if We Could Rebuild Kafka From Scratch?&#8221;</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4KNP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4KNP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png 424w, https://substackcdn.com/image/fetch/$s_!4KNP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png 848w, https://substackcdn.com/image/fetch/$s_!4KNP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png 1272w, https://substackcdn.com/image/fetch/$s_!4KNP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4KNP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png" width="1388" height="468" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:468,&quot;width&quot;:1388,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136129,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/164742983?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4KNP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png 424w, https://substackcdn.com/image/fetch/$s_!4KNP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png 848w, https://substackcdn.com/image/fetch/$s_!4KNP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png 1272w, https://substackcdn.com/image/fetch/$s_!4KNP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6cd122e-a795-4a0a-be6e-767187883a20_1388x468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/Sirupsen/status/1927440430532088171">View Post</a></figcaption></figure></div><p>One bright spot is <a href="https://s2.dev">S2</a>. <a href="https://www.linkedin.com/in/shikhrr/">Shikhar</a> and his team seem to be rethinking what a modern, cloud-native ecosystem could look like. Notably, of all the modern streaming solutions&#8212;Redpanda, AutoMQ, WarpStream, Confluent Freight, and Bufstream&#8212;S2 stands alone as the only system that doesn&#8217;t (yet) use Kafka&#8217;s protocol; they are unconstrained.</p><p>While writing my incremental view maintenance (IVM) post, I wondered what timestamps in Kafka might look like. A month later, S2 <a href="https://s2.dev/blog/timestamping">announced their take</a>. In, &#8220;<a href="https://s2.dev/blog/intro#what-if-streams-had-the-primacy-of-objects">What if streams had the primacy of objects?</a>,&#8221; they discuss supporting per-user streams. This is the fresh thinking we need. Unfortunately, adoption is a real challenge. Kafka&#8217;s protocol network effects are strong.</p><div class="bluesky-wrap outer" style="height: auto; display: flex; margin-bottom: 24px;" data-attrs="{&quot;postId&quot;:&quot;3lmujkg5wks2m&quot;,&quot;authorDid&quot;:&quot;did:plc:cwx2zxldt3uxciob3nxzhkzr&quot;,&quot;authorName&quot;:&quot;Chris&quot;,&quot;authorHandle&quot;:&quot;chris.blue&quot;,&quot;authorAvatarUrl&quot;:&quot;https://cdn.bsky.app/img/avatar/plain/did:plc:cwx2zxldt3uxciob3nxzhkzr/bafkreib756chvsuuqkhovbsjela5ysy7yue4c62jt6kbofjeeooj3u67rq@jpeg&quot;,&quot;text&quot;:&quot;After spending some time with timely dataflow, I am wondering what it would look like if Kafka integrated time semantics into their brokers. Something like maintaining a low/high watermark for each topic/partition. You configure a maximum \&quot;slack\&quot; for each topic, and broker could reject stragglers.&quot;,&quot;createdAt&quot;:&quot;2025-04-15T16:44:10.559Z&quot;,&quot;uri&quot;:&quot;at://did:plc:cwx2zxldt3uxciob3nxzhkzr/app.bsky.feed.post/3lmujkg5wks2m&quot;,&quot;imageUrls&quot;:[]}" data-component-name="BlueskyCreateBlueskyEmbed"><iframe id="bluesky-3lmujkg5wks2m" data-bluesky-id="9164570842801842" src="https://embed.bsky.app/embed/did:plc:cwx2zxldt3uxciob3nxzhkzr/app.bsky.feed.post/3lmujkg5wks2m?id=9164570842801842" width="100%" style="display: block; flex-grow: 1;" frameborder="0" scrolling="no"></iframe></div><p>Meanwhile, the stream processing side of the house seems somehow both worse and better. I am no fan of Flink, but it has won the stream processing race (outside of <a href="https://cloud.google.com/">Google Cloud</a>, which has <a href="https://cloud.google.com/products/dataflow?hl=en">Dataflow</a>). Unspoken amongst vendors, the true winner is really just regular old Kafka consumers and producers.</p><p>I do see some glimmers of hope. Compact libraries that incorporate differential dataflow, such as <a href="https://github.com/electric-sql/d2ts">D2</a>&#65129;, are appearing. And <a href="https://www.feldera.com/">Feldera</a> seems like a genuinely better computation layer than Flink. Such systems might become the next Spark or <a href="http://snowflake.com/">Snowflake</a> of the streaming ecosystem (and batch ecosystem, in Feldera&#8217;s case).</p><p>The next decade of stream processing could look like the last decade of batch processing: one of growth and innovation as we transition out of legacy on-prem architectures, protocols, and systems into a truly cloud-native streaming era. I hope so; it would be great for everyone.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[Everything You Need to Know About Incremental View Maintenance]]></title><description><![CDATA[An overview of incremental view maintenance, why it&#8217;s useful, and how it's implemented.]]></description><link>https://materializedview.io/p/everything-to-know-incremental-view-maintenance</link><guid isPermaLink="false">https://materializedview.io/p/everything-to-know-incremental-view-maintenance</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Fri, 18 Apr 2025 18:28:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7af30c05-2222-4c48-b90b-2aa67575e950_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Incremental view maintenance has been a hot topic lately. <a href="https://materialize.com/">Materialize</a> has been around for a while, but newcomers like PostgreSQL&#8217;s (semi-working) <a href="https://github.com/sraoss/pg_ivm">pg_ivm</a> extension, <a href="https://github.com/feldera/feldera">Feldera</a>, <a href="https://epsio.io">Epsio</a>, <a href="https://bytewax.io">Bytewax</a>, and many others are starting to make noise. In the data warehousing space, <a href="https://clickhouse.com">Clickhouse</a> <a href="https://clickhouse.com/docs/materialized-view/incremental-materialized-view">supports IVMs</a> and <a href="https://www.getdbt.com/">dbt</a> now has <a href="https://docs.getdbt.com/docs/build/incremental-models">incremental models</a> (though, still batch-based). On the front-end, sync engines look increasingly like IVM engines; <a href="https://github.com/rocicorp/mono">Zero</a> and <a href="https://electricsql.com">ElectricSQL</a>&#65129; come to mind. <a href="https://www.linkedin.com/in/samwillis">Sam Willis</a> (of ElectricSQL) <a href="https://news.ycombinator.com/item?id=40419917">teased</a> the idea of IVM on ElectricSQL and has implemented <a href="https://github.com/electric-sql/d2ts">D2TS</a>, a differential dataflow engine in TypeScript.</p><p>This post will provide a high-level overview of what incremental view maintenance is, why it&#8217;s useful, and how it can be implemented. We&#8217;ll then look briefly at three (relatively) recent research papers: <a href="https://dl.acm.org/doi/10.1145/2517349.2522738">timely dataflow</a>, <a href="https://github.com/timelydataflow/differential-dataflow/blob/master/differentialdataflow.pdf">differential dataflow</a>, and <a href="https://sigmodrecord.org/publications/sigmodRecord/2403/pdfs/20_dbsp-budiu.pdf">DBSP</a>; these influential papers have influenced many (but not all) of the products listed above.</p><h2>What is IVM?</h2><p><em>(If you already know what an IVM is, you can skip this section.)</em></p><p>To understand incremental view maintenance, we must first understand views. In this context, a view is a projection of data in a specific way. Pivot tables are a view: you have a spreadsheet with data, and you create a pivot table to filter and aggregate the data from the source spreadsheet. Similarly, a table in a database can be thought of as a spreadsheet, and a database view can be thought of as a pivot table. Views in databases are defined using SQL, rather than the UIs you&#8217;re used to in Excel and Google Sheets.</p><p>Continuing with our pivot table analogy, when a user updates data in their spreadsheet, the pivot table needs to be updated to reflect changes in its filters and aggregates. The same thing applies for database tables and views.</p><p>There are many approaches to view updates. The simplest way is to re-execute the query every time the view is accessed. In Excel, this would mean every time the pivot table is viewed, it must re-run its query on the source data; the same applies for database views.</p><p>Refreshing a view on every query can be slow and costly for large datasets, though. A spreadsheet with millions of rows might halt a user&#8217;s progress for several seconds. A database with billions (or trillions) of rows might block things for much longer. An alternative is to use materialized views. A materialized view is a cached result of the view. Rather than re-computing views on every read, they are refreshed periodically. Once the computation is complete, the new version of the view is used.</p><p>This approach&#8212;periodically reprocessing the entire dataset&#8212;trades data freshness for lower read latency. Queries on views will be faster since their data has already been computed, but the data will reflect an older version of the data from when it was last refreshed. Reprocessing the entire dataset can also be wasteful. A change to a single row in the source data requires reprocessing all the source data to generate the new materialized view. <code>SELECT COUNT(*)</code> has to recount every row to discover that only one new row was inserted.</p><p>Incremental view maintenance addresses these problems. Rather than reprocessing the entire dataset every time a change occurs, IVM reprocesses only the view data affected by the change (the delta). This approach dramatically decreases the cost of maintaining the materialized view. The DBSP paper illustrates this well:</p><blockquote><p>Informally, Q&#916; built by our algorithm, is faster than Q by a factor of O(|DB|/|&#916;DB|). In practice this may be an improvement of several orders of magnitude. For our example above |DB|&#8773;10&#8313; and |&#916;DB|&#8773;10&#178;, this can make Q&#916; <strong>10 million times</strong> faster!</p></blockquote><p>Cheap updates allow us to refresh materialized views more frequently. In doing so, we can keep materialized views more up to date with their source data, thereby reducing data latency. Best of all, read latency can remain largely unaffected since readers continue to query pre-computed data.</p><h2>How do IVM engines work?</h2><p>Now that we understand what IVM is, let&#8217;s discuss how they have historically worked. There are two things to consider:</p><ul><li><p>When do incremental updates occur?</p></li><li><p>How do we know what data needs to be updated?</p></li></ul><p>The answer to the first question is fairly straightforward. Incremental materialized views are computations like any other; they can run on a fixed schedule, on an ad-hoc basis, or with a trigger. In practice, the last option&#8212;a trigger&#8212;is usually used to watch for newly changed data. In a data warehouse, Airflow&#8217;s <a href="https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/deferring.html">triggers and sensors</a> detect changes. Similarly, developers can write triggers in OLTP databases to update views whenever source tables are mutated. OLTP databases that have built-in IVM support also usually update views as source data changes. Stream processors follow the same pattern: state is updated as new events arrive.</p><p>The second question&#8212;knowing what data needs to be updated&#8212;is trickier. The most intuitive approach is to write code or ad-hoc queries to react to triggers and update data. For example, we might have an <code>orders</code> table and a <code>customer_order_totals</code> view. We can then <a href="https://en.wikipedia.org/wiki/Database_trigger">write a trigger</a> to update <code>customer_order_totals</code> whenever data is inserted into <code>order</code>.</p><pre><code>CREATE OR REPLACE FUNCTION orders_insert_trigger_fn()
RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO customer_order_totals (customer_id, total_amount)
    VALUES (NEW.customer_id, NEW.amount)
    ON CONFLICT (customer_id)
    DO UPDATE SET total_amount = customer_order_totals.total_amount + EXCLUDED.total_amount;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;</code></pre><p>Triggers and ad-hoc code work for simple cases. But complex incremental queries that use joins, window functions, or recursive algorithms are very difficult to write by hand. A more systematic approach is needed.</p><p>The systematic approach many databases have relied on uses bag algebra (<a href="https://en.wikipedia.org/wiki/Relational_algebra">relational algebra</a>). Users start by defining an incrementally maintained materialized view as a SQL query. The database then translates the SQL query into relational algebra operations such as select, project, union, intersect, join, difference, and so on. Once in relation algebra form, there is a bunch of math that shows inserts and deletes (called deltas) from source tables can be fed into the relational expression to compute the difference that must be applied to the materialized view. PostgreSQL&#8217;s <a href="https://wiki.postgresql.org/wiki/Incremental_View_Maintenance">incremental view maintenance</a> wiki is a good starting point if you want to learn more.</p><p>Problem solved, right? Developers can continue to define materialized views in standard SQL, and databases can use fancy math to operate on just the deltas rather than recomputing everything. Unfortunately, not. It turns out, the bag algebra approach does not work well for complex and computationally expensive queries, especially with recursive or nested structures. While we&#8217;ve increased our expressiveness and ease of use, our IVM is no longer as cheap as before.</p><h2>A Modern Approach</h2><p>This leads us to the papers three papers I mentioned at the beginning of this post: <a href="https://dl.acm.org/doi/10.1145/2517349.2522738">Naiad: a timely dataflow system</a>, <a href="https://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf">Differential dataflow</a>, <a href="https://sigmodrecord.org/publications/sigmodRecord/2403/pdfs/20_dbsp-budiu.pdf">DBSP: Incremental Computation on Streams and Its Applications to Databases</a>. These papers collectively present a different way to build an IVM engine that is:</p><ul><li><p>Fast enough to update materialized views on every data change</p></li><li><p>Expressive enough that developers can use query languages such as SQL or <a href="https://en.wikipedia.org/wiki/Datalog">Datalog</a> to define views</p></li><li><p>Flexible enough to be used with stream processors as well as databases</p></li></ul><p><a href="https://en.wikipedia.org/wiki/Frank_McSherry">Frank McSherry</a>, one of the foremost contributors to this space, describes these new innovations as a set of tools with different levels of flexibility and opinions. The lowest level, and most flexible, is timely dataflow. Differential dataflow is built on top of timely dataflow, and enforces some requirements on the user in order to calculate incremental updates. DBSP enforces still more requirements on the user in exchange for a simpler implementation. Let&#8217;s look at how these three systems build off one another.</p><h3>Timely Dataflow</h3><p>Timely dataflow provides a model of time that makes it easier to do IVM computation without sacrificing expressiveness. Time is modeled as a vector that includes both an epoch as well as a loop counter.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pDIf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pDIf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png 424w, https://substackcdn.com/image/fetch/$s_!pDIf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png 848w, https://substackcdn.com/image/fetch/$s_!pDIf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png 1272w, https://substackcdn.com/image/fetch/$s_!pDIf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pDIf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png" width="780" height="182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:780,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23008,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/161506403?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pDIf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png 424w, https://substackcdn.com/image/fetch/$s_!pDIf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png 848w, https://substackcdn.com/image/fetch/$s_!pDIf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png 1272w, https://substackcdn.com/image/fetch/$s_!pDIf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4a87c3-2104-4916-9aad-2bd3130bfbf4_780x182.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Section 2.1 in <a href="https://dl.acm.org/doi/10.1145/2517349.2522738">Naiad: a timely dataflow system</a></figcaption></figure></div><p>As the name implies, loop counters keep track of what iteration a specific event is on. I won&#8217;t go into detail on why this is important (see the paper), but it means timely dataflow can support deeply nested loops very easily. Graph processing algorithms, in particular, benefit from this. Indeed, the paper spends a large amount of time talking about this use case.</p><p>The next innovation in timely dataflow is how it handles time. Traditionally, time updates (&#8220;punctuations&#8221;) are sent as special events or metadata in the data stream; this forces time to cascade sequentially through a dataflow. A task receives a new time update, and forwards that outputs downstream. Such an approach can cause bottlenecks.</p><p>Timely dataflow uses out-of-band watermark broadcasts. This sounds fancy, but it just means time is tracked outside of data streams&#8212;the data and control planes are separate. Timely dataflow has every task tell every other task in the dataflow how far it&#8217;s processed outside of the data streams. Armed with this information, tasks can get a global view of time and determine when they can make forward progress. Tasks in the same dataflow can even be processing different points in time, or multiple points in time. The takeaway is that the computation can happen concurrently (and thus faster) than the bag algebra approach.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p>Practically speaking, the API for timely dataflow is fairly simple. It has four methods: SendBy, NotifyAt, OnRecv, OnNotify. The send/recv methods are to send and receive events, and the notify methods are for advancing time. I mention this because I like to think of timely dataflow a lot like Hadoop&#8217;s Map/Reduce: a powerful, but low-level framework for data processing.</p><h3>Differential Dataflow</h3><p>Differential dataflow introduces &#8220;differential computation&#8221;:</p><blockquote><p>The novelty of differential computation is twofold: first, the state of the computation varies according to a partially ordered set of versions rather than a totally ordered sequence of versions as is standard for incremental computation; and second, the set of updates required to reconstruct the state at any given version is retained in an indexed data-structure, whereas incremental systems typically consolidate each update in sequence into the &#8220;current&#8221; version of the state and then discard the update.</p></blockquote><p>The paper is very technical (I don&#8217;t understand the math in it). The gist is that it keeps track of multiple versions of data states, arranged in a partial order, rather than just a linear sequence (see the paper for examples). Timely dataflow timestamps are used to track the order. The arrangement allows the system to selectively reuse prior computations, significantly reducing the amount of work when the data is updated.</p><p>Once differential computation is defined, the paper shows (in section 4.3) that standard SQL-like operators can be built above the computation engine. The result is that developers can express incremental views using high-level query languages like SQL or Datalog. The engine handles joins, aggregations, and even recursive computations efficiently. This automation significantly simplifies building and maintaining IVM systems, especially for complex scenarios such as graph analytics or deeply nested computations. Continuing with the Hadoop ecosystem metaphor, you might describe this as the <a href="https://pig.apache.org/">Pig</a> or <a href="https://en.wikipedia.org/wiki/Cascading_(software)">Cascading</a> layer.</p><h3>DBSP</h3><p>I had to take one electrical engineering class in college. I remember being blown away when I learned that any circuit could be expressed using nothing but <a href="https://en.wikipedia.org/wiki/NAND_gate">NAND gates</a> (not-and&#8217;s). This property is known as <a href="https://en.wikipedia.org/wiki/NAND_gate#Functional_completeness">functional completeness</a>, a property that it shares with NOR gates.</p><blockquote><p>&#8230;any other logic function (AND, OR, etc.) can be implemented using only NAND gates. An entire processor can be created using NAND gates alone.</p></blockquote><p>I raise this idea because DBSP is based on <a href="https://en.wikipedia.org/wiki/Digital_signal_processing">digital signal processing</a> (DSP). The authors realized that incremental view maintenance looks somewhat similar to differentiation and integration in DSP. And much like the NAND gate phenomenon I learned about in college, the DBSP paper presents four operators (lift, delay, and two operators for recursive programs) that are the foundation for all relational operations in SQL. To get a flavor, here&#8217;s how the lift operator is depicted:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lQVb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lQVb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png 424w, https://substackcdn.com/image/fetch/$s_!lQVb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png 848w, https://substackcdn.com/image/fetch/$s_!lQVb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png 1272w, https://substackcdn.com/image/fetch/$s_!lQVb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lQVb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png" width="914" height="78" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:78,&quot;width&quot;:914,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14133,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/161506403?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lQVb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png 424w, https://substackcdn.com/image/fetch/$s_!lQVb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png 848w, https://substackcdn.com/image/fetch/$s_!lQVb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png 1272w, https://substackcdn.com/image/fetch/$s_!lQVb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ddf5a-c7ce-44ba-ba99-249bb0705a8f_914x78.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This is simply a map operator. The delay operator is also simple. It&#8217;s represented in the paper as <em>z</em>&#8315;&#185;, and it delays output by one step.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QuLT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QuLT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png 424w, https://substackcdn.com/image/fetch/$s_!QuLT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png 848w, https://substackcdn.com/image/fetch/$s_!QuLT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png 1272w, https://substackcdn.com/image/fetch/$s_!QuLT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QuLT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png" width="824" height="90" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:90,&quot;width&quot;:824,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:11886,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/161506403?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QuLT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png 424w, https://substackcdn.com/image/fetch/$s_!QuLT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png 848w, https://substackcdn.com/image/fetch/$s_!QuLT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png 1272w, https://substackcdn.com/image/fetch/$s_!QuLT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F453f29e1-33de-4a78-9d8b-1766d5e6c435_824x90.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Using circuit design and relational algebra, the paper is able to show that arbitrary SQL queries can be translated into DBSP circuits. Once converted into a DBSP circuit, the paper then shows that the circuit can be converted to an incremental DBSP circuit. This is a really powerful idea. DBSP can convert any batch-based SQL to an incremental query.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mOZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mOZT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png 424w, https://substackcdn.com/image/fetch/$s_!mOZT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png 848w, https://substackcdn.com/image/fetch/$s_!mOZT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png 1272w, https://substackcdn.com/image/fetch/$s_!mOZT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mOZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png" width="1456" height="950" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/daf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:950,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:343309,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/161506403?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mOZT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png 424w, https://substackcdn.com/image/fetch/$s_!mOZT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png 848w, https://substackcdn.com/image/fetch/$s_!mOZT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png 1272w, https://substackcdn.com/image/fetch/$s_!mOZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf3ccef-9bd3-4cb6-a3a0-bf86d67fd2e9_1860x1214.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>DBSP does make some tradeoffs when compared to differential dataflow. It simplifies the programming model by constraining how time and state management occur. This simplification limits some of the concurrency gains we see in timely and differential dataflow. In exchange for this, DBSP offers a simpler, more accessible framework for typical database and stream processing workloads.</p><h2>Putting it all Together</h2><p>Ultimately, the progression from timely dataflow to differential dataflow and then to DBSP shows a clear trajectory: moving from highly flexible, low-level frameworks toward more structured, to easier-to-use incremental computation systems.</p><p>The ideas pioneered by these three papers underlie many of the recent incremental view maintenance implementations. Materialize leverages differential dataflow directly, while newer entrants like Feldera are built on DBSP. Even frontend tools are shifting toward differential-like incremental maintenance paradigms, demonstrating the broad applicability and utility of this research.</p><p>Yet, IVM engines still have a long way to go. IVM in stream processors is still a work in progress. Many databases have missing or incomplete IVM implementations. A PostgreSQL implementation would be a very big deal. Perhaps <a href="https://www.paradedb.com/">ParadeDB</a> will build this next? <a href="https://www.linkedin.com/in/nikhilbenesch/">Nikhil Benesch</a>, former CTO of Materialize, apparently contemplated this early on:</p><div class="bluesky-wrap outer" style="height: auto; display: flex; margin-bottom: 24px;" data-attrs="{&quot;postId&quot;:&quot;3lbgdjwghyj2j&quot;,&quot;authorDid&quot;:&quot;did:plc:j626gwnrbuubdncs3dmm5hzg&quot;,&quot;authorName&quot;:&quot;Nikhil Benesch&quot;,&quot;authorHandle&quot;:&quot;benesch.bsky.social&quot;,&quot;authorAvatarUrl&quot;:&quot;https://cdn.bsky.app/img/avatar/plain/did:plc:j626gwnrbuubdncs3dmm5hzg/bafkreih4xiohnilhz7jgiyqkgn5vfgxxphtalit45jrsmjgfdsgjdudpya@jpeg&quot;,&quot;text&quot;:&quot;We considered building Materialize as a Postgres extension, back when we were getting started. But stitching together Postgres&#8217;s C codebase with differential dataflow&#8217;s Rust codebase would have been a nightmare. I think you probably need to do a ground up rewrite of DD in C.&quot;,&quot;createdAt&quot;:&quot;2024-11-21T01:42:06.876Z&quot;,&quot;uri&quot;:&quot;at://did:plc:j626gwnrbuubdncs3dmm5hzg/app.bsky.feed.post/3lbgdjwghyj2j&quot;,&quot;imageUrls&quot;:[]}" data-component-name="BlueskyCreateBlueskyEmbed"><iframe id="bluesky-3lbgdjwghyj2j" data-bluesky-id="02192683820225594" src="https://embed.bsky.app/embed/did:plc:j626gwnrbuubdncs3dmm5hzg/app.bsky.feed.post/3lbgdjwghyj2j?id=02192683820225594" width="100%" style="display: block; flex-grow: 1;" frameborder="0" scrolling="no"></iframe></div><p>Differential dataflow is just really complicated. A few years ago, <a href="https://www.scattered-thoughts.net/">Jamie Brandon</a> asked, <a href="https://www.scattered-thoughts.net/writing/why-isnt-differential-dataflow-more-popular/">Why isn't differential dataflow more popular?</a> His post is worth reading. The first <a href="https://news.ycombinator.com/item?id=25869953">Hackernews response</a> has a ring of truth to it:</p><blockquote><p>Indirectly answering the question - I've skimmed through the git README, the abstract and all the pictures in the academic paper that it references.</p><p>I have no idea what this thing does. Can someone explain in simple terms what it does?</p></blockquote><p>But systems like those I list at the top of this post make it more accessible. Rather than complex APIs and semantics, developers can use SQL. DBSP could unlock more IVM solutions, too. Getting this technology into everyone&#8217;s hands will be a very big deal.</p><p><em>A special thanks to <a href="https://en.wikipedia.org/wiki/Frank_McSherry">Frank McSherry</a>, <a href="https://www.linkedin.com/in/lalith-suresh-34bb8911">Lalith Suresh</a>, and <a href="https://www.morling.dev/">Gunnar Morling</a> for feedback on early drafts.</em></p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Frank McSherry was quick to point out that out-of-band was borrowed from <a href="https://citeseerx.ist.psu.edu/document?repid=rep1&amp;type=pdf&amp;doi=b918f77d119e874a5b3df9728b6eb9e2713e9de2">Out-of-Order Processing: A New Architecture for HighPerformance Stream Systems</a>. <a href="https://www.linkedin.com/in/takidau">Tyler Akidau</a> et al. also cite the OOP paper in <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf">MillWheel: Fault-Tolerant Stream Processing at Internet Scale</a>, which influenced Google&#8217;s Dataflow and Beam products.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[AI IDEs Need Moats]]></title><description><![CDATA[VS Code eliminated the switching cost for AI IDEs. They need to build moats to survive. Partnering with software vendors and new open source projects could help.]]></description><link>https://materializedview.io/p/ai-ides-need-moats</link><guid isPermaLink="false">https://materializedview.io/p/ai-ides-need-moats</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Wed, 12 Mar 2025 19:25:02 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/76e9d59e-6b15-40a1-934e-4ae575e985c0_1792x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://martin.kleppmann.com/">Martin Kleppmann</a> and I gave a keynote interview for the <a href="https://www.scylladb.com/monster-scale-summit/on-demand/">Monster SCALE</a> conference today. I always enjoy <a href="https://www.scylladb.com/">ScyllaDB&#8217;s</a> conferences and this one was no exception. Check out our interview <a href="https://www.scylladb.com/tech-talk/designing-data-intensive-applications-in-2025/">here</a>. While you&#8217;re at it, watch <a href="https://www.linkedin.com/in/agavra">Almog Gavra&#8217;s</a> <a href="https://slatedb.io">SlateDB</a> talk, too!</p><div><hr></div><p>I&#8217;ve avoided talking about AI on my newsletter thus far. The space is moving so rapidly. It feels futile to try and keep up, and anything written becomes stale very quickly. Still, I have been using AI daily for the past few years, primarily for coding. First with <a href="https://chat.openai.com/">ChatGPT</a>, then with <a href="https://codeium.com/windsurf">Windsurf</a>. I&#8217;m going to break my embargo for this post to cover some recent discussions I&#8217;ve had about AI IDEs like Windsurf and <a href="https://www.cursor.com/">Cursor</a>.</p><p>For the non-developers out there, Windsurf is an AI-powered agentic <a href="https://en.wikipedia.org/wiki/Integrated_development_environment">integrated developer environment</a> (IDE) akin to Cursor. In layman&#8217;s terms, these are application that developers use to write code along side an AI. Both Windsurf and Cursor are built on <a href="https://code.visualstudio.com/">Visual Studio Code</a>, and both are quite effective. I primarily use Windsurf&#8217;s AI features when writing unit tests and small scripts, be it Python, Bash, or Github Actions YAML. A few weeks back, I decided to check in and see if I should switch to Cursor.</p><div class="bluesky-wrap outer" style="height: auto; display: flex; margin-bottom: 24px;" data-attrs="{&quot;postId&quot;:&quot;3lj4r7uirik24&quot;,&quot;authorDid&quot;:&quot;did:plc:cwx2zxldt3uxciob3nxzhkzr&quot;,&quot;authorName&quot;:&quot;Chris&quot;,&quot;authorHandle&quot;:&quot;chris.blue&quot;,&quot;authorAvatarUrl&quot;:&quot;https://cdn.bsky.app/img/avatar/plain/did:plc:cwx2zxldt3uxciob3nxzhkzr/bafkreihjsl6ok5snljwgnm2brgptnqa6xqgyt7a3hsp7nt4bbwuym2cd2a@jpeg&quot;,&quot;text&quot;:&quot;Should I be using Windsurf or Cursor? What&#8217;s the verdict right now&#8230;&quot;,&quot;createdAt&quot;:&quot;2025-02-27T01:41:57.827Z&quot;,&quot;uri&quot;:&quot;at://did:plc:cwx2zxldt3uxciob3nxzhkzr/app.bsky.feed.post/3lj4r7uirik24&quot;,&quot;imageUrls&quot;:[]}" data-component-name="BlueskyCreateBlueskyEmbed"><iframe id="bluesky-3lj4r7uirik24" data-bluesky-id="6650406869241925" src="https://embed.bsky.app/embed/did:plc:cwx2zxldt3uxciob3nxzhkzr/app.bsky.feed.post/3lj4r7uirik24?id=6650406869241925" width="100%" style="display: block; flex-grow: 1;" frameborder="0" scrolling="no"></iframe></div><p>There was no consensus. My impression from the comments was that Windsurf was slightly better than Cursor. I decided to stay with Windsurf since I was already using it. Fast forward a few weeks, and <a href="https://x.com/t_blom/status/1898903125735703025">this post</a> on Twitter caught my eye:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BmRr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BmRr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png 424w, https://substackcdn.com/image/fetch/$s_!BmRr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png 848w, https://substackcdn.com/image/fetch/$s_!BmRr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png 1272w, https://substackcdn.com/image/fetch/$s_!BmRr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BmRr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png" width="1370" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:1370,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:127400,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/158782298?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!BmRr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png 424w, https://substackcdn.com/image/fetch/$s_!BmRr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png 848w, https://substackcdn.com/image/fetch/$s_!BmRr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png 1272w, https://substackcdn.com/image/fetch/$s_!BmRr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2267b1c-37fa-4773-b1c4-a8ff42b41a6a_1370x542.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/t_blom/status/1898903125735703025">View Post</a></figcaption></figure></div><p>Tom&#8217;s post validates my impression, but it also raises an interesting question about moats. As Tom says, since both Windsurf and Cursor are forks of VS Code with nearly identical interfaces, the switching cost is nearly zero. When I switched from VS Code to Windsurf, it took all of 30 minutes. I haven&#8217;t looked back. Switching from Windsurf to Cursor, I&#8217;m confident, would be the same.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LoxH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LoxH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png 424w, https://substackcdn.com/image/fetch/$s_!LoxH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png 848w, https://substackcdn.com/image/fetch/$s_!LoxH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png 1272w, https://substackcdn.com/image/fetch/$s_!LoxH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LoxH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png" width="1370" height="468" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:468,&quot;width&quot;:1370,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121153,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://materializedview.io/i/158782298?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LoxH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png 424w, https://substackcdn.com/image/fetch/$s_!LoxH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png 848w, https://substackcdn.com/image/fetch/$s_!LoxH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png 1272w, https://substackcdn.com/image/fetch/$s_!LoxH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5141f-b013-4817-95e1-a9d55dbb685d_1370x468.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/Steve8708/status/1898505242758852757">View Post</a></figcaption></figure></div><p>It&#8217;s easy to think these IDEs are destined to be commodities. I don&#8217;t think so. The fact that Windsurf is unseating Cursor right now is a signal that, even with commodity LLMs, Windsurf is offering a better product than Cursor. It does make the space highly competitive, however. As Steve&#8217;s post says, the IDEs have to be more than API calls to OpenAI and Anthropic. What might this look like?</p><p>Last week, <a href="https://www.linkedin.com/in/josh-wills-13882b">Josh Wills</a> and I were discussing AI IDEs over coffee (<a href="https://jobs.ashbyhq.com/DatologyAI">he&#8217;s hiring</a> at <a href="https://www.datologyai.com/">DatologyAI</a>, by the way!). Josh made an off-hand comment that LLMs really struggle with <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html">dataframe</a> libraries that aren&#8217;t <a href="https://pandas.pydata.org/">Pandas</a>; they hallucinate and assume you&#8217;re using Pandas. This resonated with me. I had the exact same experience when trying to learn <a href="https://ziglang.org/">Zig</a> in 2023. The language was very new, and similar enough to <a href="https://go.dev/">Go</a> and <a href="https://www.python.org/">Python</a> that ChatGPT would intermingle the three languages together.</p><div class="bluesky-wrap outer" style="height: auto; display: flex; margin-bottom: 24px;" data-attrs="{&quot;postId&quot;:&quot;3ljsph2zsic2q&quot;,&quot;authorDid&quot;:&quot;did:plc:cwx2zxldt3uxciob3nxzhkzr&quot;,&quot;authorName&quot;:&quot;Chris&quot;,&quot;authorHandle&quot;:&quot;chris.blue&quot;,&quot;authorAvatarUrl&quot;:&quot;https://cdn.bsky.app/img/avatar/plain/did:plc:cwx2zxldt3uxciob3nxzhkzr/bafkreihjsl6ok5snljwgnm2brgptnqa6xqgyt7a3hsp7nt4bbwuym2cd2a@jpeg&quot;,&quot;text&quot;:&quot;Was talking with @spite.vc yesterday. He pointed out an interesting dynamic with LLMs and AI IDEs. The tools that the LLMs can best reason about will be more broadly adopted, which creates more training data, which improves the LLM. Leads to a kind of lock-in/monopoly for those tools.&quot;,&quot;createdAt&quot;:&quot;2025-03-07T19:08:46.317Z&quot;,&quot;uri&quot;:&quot;at://did:plc:cwx2zxldt3uxciob3nxzhkzr/app.bsky.feed.post/3ljsph2zsic2q&quot;,&quot;imageUrls&quot;:[]}" data-component-name="BlueskyCreateBlueskyEmbed"><iframe id="bluesky-3ljsph2zsic2q" data-bluesky-id="44234097160646035" src="https://embed.bsky.app/embed/did:plc:cwx2zxldt3uxciob3nxzhkzr/app.bsky.feed.post/3ljsph2zsic2q?id=44234097160646035" width="100%" style="display: block; flex-grow: 1;" frameborder="0" scrolling="no"></iframe></div><p>This dynamic is interesting: LLMs struggle with new infrastructure and tools. The same is true for proprietary software&#8212;both internal company code and closed source vendor software. There isn&#8217;t enough public data on the internet to train the LLMs on such software.</p><p>A poor LLM experience with new and proprietary software makes for an interesting virtuous cycle. Developers get a better experience working with legacy software that has thousands of stack overflow questions and github repositories. A better experience increases the adoption cost for new software (or decreases the cost of using legacy software). Developers will stick with the existing software, which will give the AI IDEs and LLM models yet more data to train on. This, in turn, will further improve the developer experience for the existing software.</p><p>Proprietary vendors and new open source projects need to figure out how to break this cycle. Offering software that is far better than incumbents could justify the increased switching cost. Vendors could also build their own plugins that help with <a href="https://www.promptingguide.ai/">prompt engineering</a> and <a href="https://aws.amazon.com/what-is/retrieval-augmented-generation/">retrieval-augmented generation</a> (RAG). Anthropic&#8217;s <a href="https://www.anthropic.com/news/model-context-protocol">Model Context Protocol</a> (MCP), <a href="https://llmstxt.org/">llms.txt</a>, and <a href="https://github.com/wild-card-ai/agents-json">agents.json</a> standard might also be adopted. Indeed, <a href="https://www.thenile.dev/">Nile</a> &#65129; announced an <a href="https://github.com/niledatabase/nile-mcp-server">MCP server</a> for its product as I write this post.</p><p>AI IDEs like Windsurf and Cursor seem well positioned to help with the adoption problem. IDE companies could partner with vendors to integrate tightly into the IDE&#8217;s agents. The companies might also help with fine-tuned models, smaller LLMs purpose-built on curated data, custom integrations, VS Code plugins, or even preferred placement in the IDE&#8212;all for specific languages, tools, and vendors. A partnership between Windsurf and a company bringing a new software product to market would be symbiotic. Windsurf would build a better (more sticky) IDE while the company it partners with would gain a better developer experience. Windsurf might even be able to charge for such services.</p><p>AI IDE companies could extend this work to internal company software as well. This is where <a href="https://sourcegraph.com">Sourcegraph</a> (the company that makes <a href="https://sourcegraph.com/cody">Cody</a>) got its start. Internal company codebases are not typically publicly available, so LLMs know less about them. Both Cursor and Windsurf are already selling into the enterprise; they should be able to offer a wide range of tools for this use case.</p><p>And then there is <a href="https://github.com/">Github</a>. In the Bluesky thread above, <a href="https://bsky.app/profile/qianli.dev">Qian Li </a><a href="https://bsky.app/profile/qianli.dev/post/3ljtbvx4x2s2g">pointed out</a> that Github already has access to many proprietary (private) repositories. They also offer Copilot. They, too, seemed well positioned to compete as an AI IDE. I&#8217;m less confident in their ability to execute, however. Copilot has lagged behind its competitors despite strong competitive advantages.</p><p>So yes, AI IDEs are a very competitive space, but I believe they&#8217;ll develop moats around specialized-models, partnerships, proprietary training data, and more.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p><p></p>]]></content:encoded></item><item><title><![CDATA[These Aren’t the Catalogs You’re Looking For]]></title><description><![CDATA[Metadata is useful for data warehouses, streaming systems, humans, and machines. We don't need different catalogs for each use case, though. Catalog convergence is here.]]></description><link>https://materializedview.io/p/these-arent-the-catalogs-youre-looking</link><guid isPermaLink="false">https://materializedview.io/p/these-arent-the-catalogs-youre-looking</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Fri, 21 Feb 2025 00:44:24 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/166c6083-6b2d-47d5-8c20-a4fb2d925129_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>House keeping! First up, Materialized View hit a new landmark while I wasn&#8217;t looking. 5000 subscribers! Thanks again for all of your support.</p><p>Earlier this month, I joined <a href="https://infrapod.io">The Infra Pod</a> to chat about all things infrastructure. Give it a listen:</p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a6579c4c1584b205a653d5f01&quot;,&quot;title&quot;:&quot;The Infra Pod&quot;,&quot;subtitle&quot;:&quot;The Infra Pod&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/7qZ7hDmHEQdAOLaJvi7S9o&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/7qZ7hDmHEQdAOLaJvi7S9o" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><p>Finally, I re-did my personal blog, <a href="https://cnr.sh">cnr.sh</a>, as well. I am planning to do some non-infrastructure writing this year. If you&#8217;d like to keep up to date, please do follow my <a href="https://cnr.sh/posts/comparing-apache-cncf-commonhaus/">newsletter</a> or <a href="https://cnr.sh/rss.xml">RSS feed</a> there. I just published my first post, <a href="https://cnr.sh/posts/comparing-apache-cncf-commonhaus/">Comparing Apache, CNCF, and Commonhaus</a>.</p><p>The blog launch led to a couple of fun Python libraries: <a href="https://github.com/criccomini/markupdown">markupdown</a> and <a href="https://github.com/criccomini/python-docstring-markdown">python-docstring-markdown</a>. Markupdown is the static site generator I&#8217;ve always wanted. It acts more like a build system rather than a classical SSG. python-docstring-markdown was just scratching an itch I had: I wanted to generate a simple DOCUMENTATION.md file from Python docstrings in my Python libraries.</p><div><hr></div><p><a href="https://github.com/gabledata/recap">Recap</a> is a project I started working on a few years ago. It started as a <a href="https://cnr.sh/posts/2023-01-05-recap-for-people-who-hate-data-catalogs/">tiny data catalog for machines</a> and morphed into a type system. It provides a single schema definition language (SDL) that can describe database schemas, event schemas such as <a href="https://avro.apache.org/">Avro</a>, and web service schemas such as <a href="https://json-schema.org/">JSON schema</a> and <a href="https://github.com/protocolbuffers/protobuf">Protobuf</a>. The project has now been adopted by <a href="https://www.gable.ai/">Gable.ai</a>; they use it as the SDL for their data contracts.</p><p>One of Recap&#8217;s features was its <a href="https://github.com/gabledata/recap/tree/main/recap/clients">client framework</a>. I implemented readers for <a href="https://cloud.google.com/bigquery?hl=en">BigQuery</a>, <a href="https://www.confluent.io/">Confluent&#8217;s</a> <a href="https://docs.confluent.io/platform/current/schema-registry/index.html">schema registry</a>, and a slew of databases and filesystems. I began to realize that these systems were all data catalogs. Some called themselves registries, others were <em><a href="https://en.wikipedia.org/wiki/Information_schema">information_schema</a></em>, and many called themselves catalogs. But they were all handling metadata about your data; they seemed 80% the same. Yet they were being used for different things.</p><div class="bluesky-wrap outer" style="height: auto; display: flex; margin-bottom: 24px;" data-attrs="{&quot;postId&quot;:&quot;3l7k2qc4x5i2g&quot;,&quot;authorDid&quot;:&quot;did:plc:3cchvbnsd7q6ihlctq7q3dza&quot;,&quot;authorName&quot;:&quot;&quot;,&quot;authorHandle&quot;:&quot;archive.chris.blue&quot;,&quot;authorAvatarUrl&quot;:&quot;https://cdn.bsky.app/img/avatar/plain/did:plc:3cchvbnsd7q6ihlctq7q3dza/bafkreib6vqc7wdc4encwoyell3aprjj6534sopigjddbuxgl3i6m7zlvrq@jpeg&quot;,&quot;text&quot;:&quot;For your consideration. There are two different kinds of catalogs:\n\n- Single system data catalogs (like Iceberg Catalog, HMS)\n- Multi-system data catalogs (like Amundsen, Datahub, etc)\n\nSchema registries fall somewhere in here, too.\n\nNot a full-fledged thought. Feedback welcome. &quot;,&quot;createdAt&quot;:&quot;2023-10-23T17:30:08.000Z&quot;,&quot;uri&quot;:&quot;at://did:plc:3cchvbnsd7q6ihlctq7q3dza/app.bsky.feed.post/3l7k2qc4x5i2g&quot;,&quot;imageUrls&quot;:[&quot;https://cdn.bsky.app/img/feed_thumbnail/plain/did:plc:3cchvbnsd7q6ihlctq7q3dza/bafkreici3vk345erbk7kak2rs754gem344j52a5m65lhknor4a6mysrxre@jpeg&quot;]}" data-component-name="BlueskyCreateBlueskyEmbed"><iframe id="bluesky-3l7k2qc4x5i2g" data-bluesky-id="5097283781704969" src="https://embed.bsky.app/embed/did:plc:3cchvbnsd7q6ihlctq7q3dza/app.bsky.feed.post/3l7k2qc4x5i2g?id=5097283781704969" width="100%" style="display: block; flex-grow: 1;" frameborder="0" scrolling="no"></iframe></div><p>Registries are typically used to store event schemas for streaming systems in Avro, Protocol Buffers, and JSON Schema. Service catalogs such as <a href="https://backstage.io/docs/overview/what-is-backstage">Backstage</a> and <a href="https://www.apicur.io/registry/">Apicuriou</a> serve a similar purpose for web services. <em>information_schema</em>-style metadata is used in both <a href="https://en.wikipedia.org/wiki/Online_transaction_processing">OLTP</a> and <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a> databases. &#8220;Traditional&#8221; data catalogs such as <a href="https://www.collibra.com/">Collibra</a>, <a href="https://atlan.com/">Atlan</a>, <a href="https://www.alation.com/">Alation</a>, <a href="https://open-metadata.org/">Open Metadata</a>, and <a href="https://datahubproject.io/">Datahub</a> began with data discovery use cases.</p><p>Each use case has a distinct yet overlapping set of requirements. Registries must be low latency and highly available since realtime infrastructure often queries them in production. They must also track schema compatibility to prevent breaking changes. Similarly, OLTP metadata needs to be very fast since it&#8217;s being used by query engines serving production traffic. OLAP metadata, on the other hand, need not be as low latency. Instead, tracking things like lineage, schema evolution, access controls, and sensitive information are important. Service catalogs are concerned with discovery, but also schema compatibility.</p><p>I&#8217;m not convinced that each of these use cases deserves its own distinct catalog. Each use case has separate requirements, but it seems to me that one system could service them all.</p><p>Indeed, I&#8217;m beginning to notice a lot of convergence in this space. Backstage, which started as a microservice catalog, now covers data pipelines, machine learning models, and more. Datahub, which began as a data discovery tool, <a href="https://github.com/datahub-project/datahub/issues/12133">is going to adopt Iceberg&#8217;s REST API</a>. Confluent&#8217;s <a href="https://www.confluent.io/blog/introducing-tableflow/">Tableflow</a> now speaks Iceberg&#8217;s REST API and converts schemas for their registry. <a href="https://github.com/lakekeeper/lakekeeper">LakeKeeper</a>, an Iceberg REST-compatible data catalog, has adopted <a href="https://github.com/lakekeeper/lakekeeper?tab=readme-ov-file#scope-and-features">data contract</a> features similar to those you find in Confluent&#8217;s schema registry or Gable&#8217;s data contract product. Datahub has also tried to <a href="https://datahubproject.io/docs/managed-datahub/observe/data-contract/">adopt data contracts</a>.</p><p>Catalog convergence makes sense. We spent the last 10-15 years doing data integration. Data pipelines are <a href="https://medium.com/@kessler.viktor/supply-chains-already-solved-this-what-metadata-can-learn-f9c3b3525721">supply chains now</a>. All of our data (and metadata) flows through OLTP, streaming systems, batch, and data warehouse systems. It no longer makes sense to have a separate catalog for each use case.</p><p>The optimist in me hopes the convergence continues; &#8220;data catalog&#8221; should mean only one thing. If Iceberg&#8217;s REST API evolves fast enough, it might become the lingua franca that we need. The realist in me knows that this is unlikely. Instead, we&#8217;ll probably end up with catalog gateways or proxies.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[Epsio, IVMs, and Differential Dataflow With Gilad Kleinman]]></title><description><![CDATA[Gilad Kleinman, CEO of Epsio, talks with me about incremental view maintenance (IVM). We discuss use cases, the history of IVM, how Epsio is different, and where the field is headed.]]></description><link>https://materializedview.io/p/epsio-ivms-differential-dataflow</link><guid isPermaLink="false">https://materializedview.io/p/epsio-ivms-differential-dataflow</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Tue, 11 Feb 2025 17:03:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/27aae483-5941-48e3-bf14-e96663589372_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;m planning to spend some time with differential data flow, timely dataflow, and incremental materialized views. As a first step, I talked with <a href="https://www.linkedin.com/in/gilad-k/">Gilad Kleinman</a>, co-founder and CEO of <a href="https://epsio.io">Epsio</a>. Epsio is an incremental view stream processor that feeds off a database&#8217;s replication feed and writes materialized data back to the DB.</p><p>Before co-founding Epsio, Gilad worked as a product manager for <a href="https://www.hpe.com/us/en/solutions/axis-security.html">Axis Security</a> and was an R&amp;D team lead for <a href="https://en.wikipedia.org/wiki/Unit_8200">Israeli Military Intelligence</a>.</p><div><hr></div><p><em><strong>C.R.: Let's start with the basics to get everyone on the same page. Can you give me a brief definition of what streaming SQL is and what it's useful for?</strong></em></p><p>G.K: Unlike "traditional" SQL that processes a dataset at a specific point in time, streaming SQL refers to constantly processing a stream of changes and understanding how a given change in the underlying data affected the previously outputted result&#8212;without ever re-processing the entire dataset.</p><p>For example, if you are running a query that calculates the count of items in a warehouse per category, a streaming SQL engine will output the initial result of that query and just do "plus one" to the relevant category whenever an item gets added and "minus one" when an item gets removed. Although this specific example is a fairly easy one to imagine, the same concept could be applied to much more complex queries containing many joins, subqueries, window functions, and so on.</p><p>In scenarios where an application runs a predefined set of queries on a dataset (e.g. in-app analytics, data modeling, reporting, etc.), streaming SQL can help you get faster and cheaper results by orders of magnitude compared to reprocessing the entire dataset each time you run the query.</p><p><em><strong>C.R.: There are so many streaming SQL offerings out there: <a href="https://materialize.com/">Materialize</a>, <a href="https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/table/sql/">Flink SQL</a>, <a href="https://www.feldera.com/">Feldera</a>, <a href="https://risingwave.com/">RisingWave</a>, and others. I tend to split solutions into one of two categories: streaming databases (RisingWave, Materialize) and stream processing systems (Flink SQL, <a href="https://www.confluent.io/blog/ksql-streaming-sql-for-apache-kafka/">kSQL</a>). Does this match your mental model? And what differentiates Epsio from these systems?</strong></em></p><p>G.K.: This model makes a lot of sense. Stream processing systems like Flink SQL and kSQL focus on outputting changes in query results to downstream systems (e.g., Kafka), leaving users responsible for consuming and processing these changes later. On the other hand, streaming databases go a step further by materializing up-to-date query results based on the stream of changes and serving them directly to clients, much like traditional databases. Unlike stream processing systems, streaming databases often allow users to connect via SQL clients (usually PostgreSQL) to interact with the data, run ad hoc queries, create indexes, and perform other typical database operations.</p><p>At Epsio, we&#8217;re huge fans of PostgreSQL and MySQL. Our goal is to bring the benefits of streaming databases to the existing PostgreSQL/MySQL deployments that organizations already use and love. Epsio achieves this by natively consuming the CDC stream from these databases and updating result tables directly within the original database. With this architecture, creating a new &#8220;streaming query&#8221; is as simple as calling a stored procedure in your current PostgreSQL/MySQL instance and querying the resulting table&#8212;essentially transforming your existing database into a streaming database.</p><p>Since Epsio's results sit within the original database, organizations get the best of both worlds &#8212; a robust streaming SQL engine, with all the amazing features MySQL/PostgreSQL already have to offer (diverse indexing options, table partitioning, constraints, etc.) and without needing to migrate to a new database.</p><p><em><strong>C.R.: Streaming SQL without having to add a new database is very appealing. Though, it sounds like Epsio still needs to run as a separate process on a separate machine. How would you compare your solution to PostgreSQL&#8217;s <a href="https://github.com/sraoss/pg_ivm">pg_ivm</a> extension?</strong></em></p><p>G.K.: When we originally built Epsio, we started out by building it as a PostgreSQL extension (similar to pg_ivm). Pretty quickly, we changed the architecture to what we currently offer based on pretty strong feedback we got from initial users/customers.</p><p>Other than the fact most managed database offerings (<a href="https://aws.amazon.com/rds/">RDS</a>, <a href="https://cloud.google.com/sql">Cloud SQL</a>, and so on) don't allow users to install unauthorized extensions, adopting new database technologies is a pretty scary endeavor. We found that asking companies to install a new extension (that could potentially crash) on their production database was a pretty big ask to make. By sitting "behind" the existing database, reading CDC logs, and writing back results to the original database, users can integrate Epsio without worrying about affecting anything other than the results tables it needs to maintain. We even actively recommend not giving Epsio permissions to anything other than that.</p><p>Additionally, in a world where PostgreSQL and MySQL doesn't scale out elegantly, we found that being able to offload some computations to a separate instance using Epsio was actually a benefit to many customers who were already reaching the limits of what a single PostgreSQL or MySQL instance could do.</p><p>Regarding pg_ivm &#8212; other than the mentioned points above regarding being an extension, I believe it is currently still a bit of a "batch processor in disguise" and not ripe yet for real-world use cases. Doing streaming SQL at scale for real-world use cases is no easy task, and I believe the way pg_ivm stores data (PG tables vs. storage engine built for streaming), passes data between nodes ("static rows" and not "changes in data" that can be consolidated), and executes queries (still missing support for basic operators like OUTER JOIN) just isn't enough yet for the use cases we see companies needing.</p><p><em><strong>C.R.: The isolation Epsio offers by running on a separate instance is compelling. Ultimately, though, it has to write back into the database. In my experience, it&#8217;s possible for stream processing systems to overwhelm OLTP systems that they write into.</strong></em></p><p><em><strong>One example would be an Epsio instance that is offline for an extended period of time, then comes back and has to catch up by reading all of the CDC logs it missed. In such a scenario, I could imagine Epsio inserting data very frequently for a burst of time. How does Epsio prevent such issues?</strong></em></p><p>G.K.: One of the exciting things in streaming SQL technologies is the concept of consolidations. To avoid redundant calculations/writes back to the DB, consumed changes are "consolidated" internally before writing back to database. This means that, for example, if you have 1M changes that affect a specific COUNT aggregations you have, instead of updating the result COUNT 1M times, all the changes will be consolidated into a single UPDATE to reflect the latest count after processing all the changes. Another example is if a row gets updated 100 times &#8212; the changes will be "consolidated" into a single update of the last value of the row.</p><p>Additionally, since the engine ensures that the outputted results are transactionally correct (i.e., a single transaction in the PostgreSQL/MySQL base tables is never split across multiple transactions updating the results table), Epsio actually only uses up to a single connection per results table to update results, significantly limiting the amount of stress a single view can generate in the database (since both PostgreSQL and MySQL don't use more than one core per INSERT/DELETE operation).</p><p>Having said all the above, every piece of software has its edge cases. If you have hundreds of large views that need to be fully rewritten after a long downtime, it might still be smart to limit the number of connections the `epsio_user` (the user with limited permissions created for Epsio to have access to the database) can create on the database!</p><p><em><strong>C.R.: There seem to be a few different ways to implement incremental materialized views. Companies like <a href="https://materialize.com/">Materialize</a> and <a href="https://www.feldera.com/">Feldera</a> use <a href="https://dl.acm.org/doi/10.1145/2517349.2522738">timely dataflow</a>, <a href="https://github.com/TimelyDataflow/differential-dataflow">differential dataflow</a>, and <a href="https://www.vldb.org/pvldb/vol16/p1601-budiu.pdf">DBSP</a>. Can you give me a 10,000 foot view of these approaches, and which you chose for Epsio?</strong></em></p><p>G.K.: Sure! So basically, both timely/differential dataflow and DBSP are pretty awesome new theories/frameworks/libraries that allow you to create incremental materialized views with a handful of benefits that the more "legacy" (e.g., <a href="https://flink.apache.org/">Flink</a>) stream processors don't offer:</p><ul><li><p>Being highly parallel while promising high consistency - Both differential and DBSP allow you to process in parallel a stream of changes, while always ensuring the result is consistent (for example, if a user adds $100 to their bank balance and then withdraws $90 &#8212; their bank balance would never reflect the withdrawal before the addition).</p></li><li><p>Supporting "iterative" syntax - Both differential and DBSP support "iterative" syntax, meaning they can efficiently handle recursive queries or computations that depend on previous results.</p></li></ul><p>Apart from the above benefits, they offer a new (and very elegant!) way to look at changes in data and process them much more efficiently compared to more "old-school" ways.</p><p>Although not directly based on differential dataflow/DBSP, Epsio's approach is probably more similar to the differential dataflow approach, although not completely similar (we are disk-based and have approached a couple of things differently given the use case Epsio is trying to optimize for).</p><p><em><strong>C.R.: Users sometimes have a hard time understanding where <a href="https://wiki.postgresql.org/wiki/Incremental_View_Maintenance">incremental view maintenance</a> (IVM) fits into their stack. I most often see it used for fraud detection, where fraud detection models need up to date information. What other use cases and verticals do you see it being adopted in? What problems can you solve with it that users might not be aware of?</strong></em></p><p>G.K.: Since the "Streaming SQL" world evolved out of the streaming world, most people initially associate it with "streaming" use cases like fraud /anomaly detection, notification, and so on. However, we've seen a pretty big shift in the last couple of years from those use cases to more SQL-related use cases.</p><p>Such use cases, which run complex and non-ad hoc queries on changing datasets, are common in areas like:</p><ul><li><p>Customer facing analytics (e.g. dashboards)</p></li><li><p>Data transformations an enrichments (e.g. incremental DBT)</p></li><li><p>Real time reporting</p></li></ul><p>In classic "streaming" use cases, the main benefit of IVM was the ease of writing SQL rather than writing custom code. In the use cases above, the benefits are more about query performance and cost&#8212;how easy it is to deliver performant, cost-effective queries. No matter how fast or efficient a traditional database is, if you are running a heavy query and most of the dataset hasn&#8217;t changed since the last run, there is a lot of wasted compute. This translates into either higher cost, higher latency, or both.</p><p>We've seen companies make customer-facing analytics 100x faster, drop infrastructure costs by an order of magnitude, and turn reports that took hours to execute into ones that complete in under a second.</p><p><em><strong>C.R.: Thanks for taking the time to talk with me. Let&#8217;s wrap things up. Where do you see IVM headed in the next few years. Any other parting thoughts? Where can readers get in touch with you?</strong></em></p><p>G.K.: Happy to!</p><p>Regarding the upcoming years in IVMs, I think exciting times are coming. Whether it's in the use cases we earlier discussed or new use cases that emerge as IVM technologies evolve (like what the folks at ReadySet are working on with database caching), I believe IVMs offer a pretty fundamental change in how organizations can work with their data (i.e., building data structures around specific queries vs. trying to store data to be "generically" fast for any query) and that we're going to see them much more widely used!</p><p>For anybody interested in learning more about <a href="https://www.epsio.io/">Epsio</a>, you can head over to our website at <a href="http://epsio.io/">epsio.io</a> or just ping us at <a href="mailto:contact@epsio.io">contact@epsio.io</a>. We&#8217;re always happy to chat!</p>]]></content:encoded></item><item><title><![CDATA[Infrastructure Vendors Are in a Tough Spot]]></title><description><![CDATA[Cloud native, AI-enabled, post-ZIRP companies are the new apex predator.]]></description><link>https://materializedview.io/p/infrastructure-vendors-are-in-a-tough</link><guid isPermaLink="false">https://materializedview.io/p/infrastructure-vendors-are-in-a-tough</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Mon, 13 Jan 2025 18:43:12 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/463fbeca-ce9f-4bf9-849a-0d767b75c77f_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.cs.cmu.edu/~pavlo/">Andy Pavlo</a> posted his <a href="https://www.cs.cmu.edu/~pavlo/blog/2025/01/2024-databases-retrospective.html">yearly database retrospective</a> to kick off the new year. Andy covers license changes, the catalog wars, DuckDB, funding, and more. A brief comment in the acquisition section got me thinking:</p><blockquote><p><strong><a href="https://www.warpstream.com/blog/warpstream-is-dead-long-live-warpstream">Warpstream &#8594; Confluent</a><br></strong>Rewriting Kafka in golang but then making it spill to S3. I'm happy for the Warpstream team, but Confluent could have done this themselves.</p></blockquote><p>As Andy says, WarpStream&#65129; is a Kafka re-implementation that uses S3 rather than local disks (a <a href="https://www.warpstream.com/blog/zero-disks-is-better-for-kafka">Zero Disk Architecture</a>). The tradeoff with this design is that writes take a little bit longer to be durably sent&#8212;usually 100ms-500ms. It turns out, a little bit of extra latency is tolerable for many use cases, especially when you factor in cost savings, operational simplicity, and the ease of BYOC support.</p><p>Andy&#8217;s comment begs the question, if Confluent could build WarpStream, why didn&#8217;t they? One could offer many theories: it was cheaper to buy than to build, to eliminate competition, to capitalize on WarpStream&#8217;s branding, to acquire talent, or something else.</p><p>I believe Confluent didn&#8217;t build WarpStream because it wasn&#8217;t set up, as an organization, to build such a product. To understand why, let&#8217;s first consider what WarpStream did: they built a Kafka-compatible BYOC solution with a handful of engineers in less than 18 months. They were also in the process of rolling out <a href="https://www.warpstream.com/blog/deterministic-simulation-testing-for-our-entire-saas">WarpStream SaaS</a> and adding transaction support when they were acquired.</p><p>Next, let&#8217;s consider what percentage of total Kafka usage in the world would have been satisfied with a 100ms-500ms SaaS or BYOC Kafka offering. My guess is a significant portion; well over 50% of all usage.</p><p>This is just my guess, but it sounds right based on my experience. All of our Kafka usage at WePay would have been completely fine with a 250ms-500ms P99 latency. All of our data integration, our ETL, our service-to-service queuing would have worked just fine. And we would have loved to be able to run Kafka as a stateless service in our own cloud on <a href="https://cloud.google.com/storage">Google Cloud Storage</a>.</p><p>Another thing to consider is revenue: WarpStream can be run very cheaply. This let WarpStream undercut Confluent on cost. They went hard on this messaging, talking about how S3 allowed them to bypass Amazon Web Service&#8217;s (AWS) <a href="https://docs.aws.amazon.com/cur/latest/userguide/cur-data-transfers-charges.html">inter-availability zone costs</a>.</p><p>WarpStream was also moving into the platform layer. They launched <a href="https://www.warpstream.com/orbit-apache-kafka-migration-and-replication">Orbit</a> to compete with <a href="https://docs.confluent.io/platform/current/multi-dc-deployments/replicator/index.html">Confluent Replicator</a>, <a href="https://www.warpstream.com/managed-data-pipelines-kafka-compatible-stream-processing">Managed Data Pipelines</a> to compete with <a href="https://docs.confluent.io/platform/current/connect/index.html">Kafka Connect</a>, and a <a href="https://www.warpstream.com/blog/introducing-warpstream-byoc-schema-registry">BYOC schema registry</a> to compete with <a href="https://docs.confluent.io/platform/current/schema-registry/index.html">Confluent Schema Registry</a>.</p><p>Let&#8217;s pause to let this sink in. WarpStream was in the process of building a cheaper (and arguably better) version of Confluent&#8217;s platform with a handful of engineers that could service the majority of Confluent customers&#8217; use cases. That&#8217;s a pretty staggering statement.</p><p>To compete, Confluent would have to build WarpStream. Indeed, they tried to do this with Confluent Freight. Shortly before the WarpStream acquisition, Confluent announced <a href="https://www.confluent.io/blog/introducing-confluent-cloud-freight-clusters/">Confluent Freight</a> as Kafka re-implementation that writes to S3, ABS, or GCS. It&#8217;s nearly identical to WarpStream, but there&#8217;s <a href="https://www.confluent.io/blog/introducing-confluent-cloud-freight-clusters/">one key difference</a>:</p><blockquote><p>Freight clusters utilize the latest innovations in Confluent Cloud&#8217;s cloud-native engine, <a href="https://www.confluent.io/blog/cloud-native-data-streaming-kafka-engine/">Kora</a>, to deliver low cost networking by trading off ultra low latency performance.</p></blockquote><p>Freight is built on top of Confluent <a href="https://www.confluent.io/blog/cloud-native-data-streaming-kafka-engine/">Kora</a>. The very first paragraph of Kora&#8217;s announcement post says:</p><blockquote><p>Kora isn&#8217;t something you can download, or even something you could run outside our control plane and the rest of our cloud</p></blockquote><p>And there-in lies a very significant difference. Freight can&#8217;t be run outside of Confluent&#8217;s cloud. WarpStream can. Why would Confluent build Freight without BYOC support? I think the answer is that it was too encumbered by its legacy tech stack, its customers (<a href="https://en.wikipedia.org/wiki/The_Innovator%27s_Dilemma">the innovator&#8217;s dilemma</a>), and its organization (<a href="https://en.wikipedia.org/wiki/Conway%27s_law">Conway&#8217;s law</a>).</p><p>Apache Kafka was built before cloud native architectures existed. It had to solve its own replication, handle its own leadership election, and manage its own storage. As with all legacy systems, this means it carries baggage with it. Many design decisions would have been made differently if it started from scratch. This is why Confluent has forked Kafka and why WarpStream wrote their implementation from scratch.</p><p>Kafka&#8217;s original design did, however, come with some flexibility that newer cloud-native designs didn&#8217;t have. Namely, it could handle low-latency use cases that systems like WarpStream couldn&#8217;t. I&#8217;m not making the claim that it&#8217;s impossible to build a low-latency cloud-native Kafka (Confluent has done so with Kora). But I&#8217;m saying that, even if you did that, it would look different from Kafka.</p><p>Regardless, Confluent picked up some large, high-margin customers that depend on some of Kafka&#8217;s more exotic features, including its low latency support.</p><blockquote><p>[<a href="https://en.wikipedia.org/wiki/The_Innovator%27s_Dilemma">The Innovator&#8217;s Dilemma</a>] describes how large incumbent companies lose market share by listening to their customers and providing what appears to be the highest-value products, but new companies that serve low-value customers with poorly developed technology can improve that technology incrementally until it is good enough to quickly take market share from established business.</p></blockquote><p>Confluent can&#8217;t easily abandon these customers and their use cases. Such customers tend to be big name companies with large contracts (and presumably, higher margins). Confluent needs this revenue; they now have over 3,300 employees on six continents (according to <a href="https://leadiq.com/c/confluent/5a1d86f82400002400611935#:~:text=How%20many%20employees%20does%20Confluent,R.%20S.Chief%20Customer%20Officer:%20R.%20P..">leadIQ</a>). Some of Confluent&#8217;s size could be attributed to its maturing during <a href="https://en.wikipedia.org/wiki/Zero_interest-rate_policy">zero interest rate policy</a> (ZIRP) when money was cheap. Some could also be attributed to its obligation to support such use cases with more complex systems. Regardless, WarpStream was unencumbered by both.</p><p>Finally, even if Confluent decided that building Freight with BYOC support was the right thing to do, they still had to deal with entrenched teams in their organization whose raison d'&#234;tre was in direct conflict with a BYOC Freight solution. If Confluent could service the majority of their users with a product that looked like WarpStream, what happens to the Kora team, the legacy Kafka teams, the entire sales motion, the marketing, and so on? Such a pivot would be seismic. Many teams would need to be eliminated or reorganized.</p><p>You need not look far to see what such a change would look like. <a href="https://en.wikipedia.org/wiki/Elon_Musk">Elon Musk</a>&#8217;s shake up at <a href="https://twitter.com/">Twitter</a> is a reflection of very similar headwinds.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oPpt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oPpt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png 424w, https://substackcdn.com/image/fetch/$s_!oPpt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png 848w, https://substackcdn.com/image/fetch/$s_!oPpt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png 1272w, https://substackcdn.com/image/fetch/$s_!oPpt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oPpt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png" width="1160" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:1160,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:114814,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oPpt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png 424w, https://substackcdn.com/image/fetch/$s_!oPpt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png 848w, https://substackcdn.com/image/fetch/$s_!oPpt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png 1272w, https://substackcdn.com/image/fetch/$s_!oPpt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e818c4-ae6c-48f2-891c-d8d8cd0e2d5b_1160x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://bsky.app/profile/chris.blue/post/3lbzliafoxs2q">View Post</a></figcaption></figure></div><p>It&#8217;s simply much easier to build very lean, very efficient companies now. When you compound cloud native efficiencies with AI-enabled developers, things get even more pronounced. New companies are adding engineers at a rate that accounts for the AI multiplier that these engineers come with. <a href="https://chatgpt.com/">ChatGPT</a>, <a href="https://www.anthropic.com/claude">Claude</a>, <a href="https://codeium.com/windsurf">Windsurf</a>, <a href="https://www.cursor.com/">Cursor</a>, and many other AI tools have improved engineering productivity.</p><p>Legacy companies built their teams before the AI shift happened. It&#8217;s harder and more painful to turn the ship. Salesforce recently announced <a href="https://www.salesforceben.com/salesforce-will-hire-no-more-software-engineers-in-2025-says-marc-benioff/">it won&#8217;t hire engineers in 2025</a>. Meta is looking forward to <a href="https://www.reddit.com/r/cscareerquestions/comments/1hyt995/zuck_publicly_announcing_that_this_year_ai/">replacing mid-level engineers with AI</a>. <a href="https://www.sfchronicle.com/tech/article/linkedin-tech-layoffs-19942026.php">LinkedIn had layoffs</a>, and is running very lean (from what I hear). New competitors don&#8217;t have to deal with the damage that these changes cause. This has left legacy companies in a tight spot.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7_CN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7_CN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png 424w, https://substackcdn.com/image/fetch/$s_!7_CN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png 848w, https://substackcdn.com/image/fetch/$s_!7_CN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png 1272w, https://substackcdn.com/image/fetch/$s_!7_CN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7_CN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png" width="1156" height="594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:594,&quot;width&quot;:1156,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:169854,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7_CN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png 424w, https://substackcdn.com/image/fetch/$s_!7_CN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png 848w, https://substackcdn.com/image/fetch/$s_!7_CN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png 1272w, https://substackcdn.com/image/fetch/$s_!7_CN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6770b645-ef6d-4a97-b215-fd7de1ca2247_1156x594.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://bsky.app/profile/chris.blue/post/3lbzliagdi22q">View Post</a></figcaption></figure></div><p>Given all this, it&#8217;s no surprise that Confluent decided to build Freight on top of Kora. WarpStream fits nicely in the product offering as its BYOC option. But as I say in my post above, I worry this strategy is delaying the inevitable.</p><p>Kafka, itself, is commoditized. <a href="https://buf.build/product/bufstream">Bufstream</a>, <a href="https://www.automq.com/">AutoMQ</a>, WarpStream, <a href="https://www.redpanda.com/">Redpanda</a>, and <a href="https://s2.dev/">S2</a> show this. When these companies move up the platform stack, as WarpStream was doing, as Bufstream is doing with its schema registry, Confluent will be competing with much more efficient organizations.</p><p>Much of this post has been about Confluent and WarpStream, but that&#8217;s simply the angle I&#8217;ve taken. I don&#8217;t mean to pick on them. It&#8217;s easy to see this pattern across the industry. Many new startups are going after Elastic, for example. <a href="https://www.datadoghq.com/">Datadog</a>&#8217;s <a href="https://quickwit.io/">Quickwit</a> <a href="https://quickwit.io/blog/quickwit-joins-datadog">acquisition</a> is a strong signal there. Legacy infrastructure companies are going to have a tough time competing with these new cloud native, AI-enabled, post-ZIRP companies.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[S3 Is the New SFTP]]></title><description><![CDATA[Customers want their data. A customer data lake is a great way to give it to them.]]></description><link>https://materializedview.io/p/s3-is-the-new-sftp</link><guid isPermaLink="false">https://materializedview.io/p/s3-is-the-new-sftp</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Mon, 16 Dec 2024 12:03:01 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/30a342ba-9603-404f-bcad-9d1c47a86c0c_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve come to realize that payment providers have a comparatively diverse set of  data processing patterns. Fintech startups have transactional workflows that lend themselves to <a href="https://materializedview.io/p/durable-execution-justifying-the-bubble">durable execution</a>; they have low-latency processing requirements that benefit from stream processing; <a href="https://developer.squareup.com/blog/books-an-immutable-double-entry-accounting-database-service/">they</a> <a href="https://github.com/wepay/waltz">have </a>esoteric double-entry book-keeping systems with strong <a href="https://en.wikipedia.org/wiki/ACID">ACID</a> guarantees; and they have fraud detection use cases that require machine learning (ML) and specialized graph, search, and data warehouse systems.</p><p>One of the less glamorous data processing tasks payment providers do is shuffle files between their vendors and partners. Such shuffling usually involves <a href="https://en.wikipedia.org/wiki/SSH_File_Transfer_Protocol">Secure File Transfer Protocol</a> (SFTP). Simply put, SFTP allows you to securely upload or download a file from a server. Much of the US banking system still runs on SFTP servers. Banks and credit card processors exchange files over SFTP to signal when fund transfers complete, when accounts close, and so on. These files are often written in CSV, XML, or <a href="https://en.wikipedia.org/wiki/Flat-file_database#Fixed-width_formats">fixed-length</a> file formats. <a href="https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1">How ACH works: A developer perspective</a> is a good read for those that are interested.</p><p>When I joined WePay, the company was running as a PHP monolith. One of the first microservices that I built was responsible for reliably syncing files to and from our bank and credit card processors SFTP servers. The service proved very useful, and was in heavy use when I left the company seven years later.</p><p>Separately, another of my teams was responsible for WePay&#8217;s data pipeline. We used Airflow to extract data from our production databases and load it into our data warehouse. We also used Kafka and Kafka Connect to stream changes from production databases into Kafka and other downstream systems.</p><p>One day, our business analytics team approached me to ask for a <a href="https://cloud.google.com/bigquery">BigQuery</a> dataset that could be shared with one of our larger customers. The sales team was trying to close a deal, and the customer wanted data access included in the contract. The team&#8217;s plan was to write <a href="https://airflow.apache.org/">Airflow</a> jobs that would output customer-specific data to the tables to a BigQuery dataset and grant the customers access to query it.</p><p>I hit the roof: our data pipeline (and team) wasn&#8217;t built for this. The pipeline was stable enough to be used for internal reporting, debugging, and modeling use cases, but it was not reliable enough to expose to end customers. Our pipelines would break when upstream developers changed their schema, when Kafka Connect decided to rebalance, or when our single Airflow machine failed. We did not have enough SREs and on-call engineers to provide a pipeline suitable for end-users to depend on.</p><p>SFTP never came up when talking with the business analytics team about their use case, yet the pattern is similar. We were trying to share data with an external integration partner (our customer, in this case). </p><p>The customer data sharing use case stuck with me. It was not an unreasonable request. I started to notice this pattern everywhere. We received many such data requests. The fascinating thing was that different customers requested their data in different forms. Some wanted periodic batch data dumps (like the SFTP integrations our payments team worked on). Others wanted a realtime HTTP/JSON API. Still others simply wanted a button on the website that would generate and email a PDF or CSV file. Data usage varied, too: financial reconciliation, reporting, user-facing dashboards, data products, monitoring, and so on.</p><p>Despite all this demand, &#8220;reporting&#8221;, as we called it, was rarely invested in. It was a rather murky area&#8212;the reporting data was owned by the payments product managers, yet the reporting team was part of our frontend group. And of course, my team actually managed the data pipeline. There was no real owner. WePay&#8217;s reporting and data export features languished.</p><p>I spent several weeks investigating vendor solutions for this mess. I wanted a <a href="https://en.wikipedia.org/wiki/White-label_product">white-label</a> data platform that we could send our data to. The ideal product would then expose this data as a WebSocket API, HTTP/JSON API, data warehouse, CSV, SFTP, dashboard, or whatever other whacky format our customers wanted. To my dismay, I found nothing that met our needs.</p><p>I&#8217;ve never given up on this vision, and I continue to see use cases for such a platform everywhere. <a href="https://stripe.com/sigma">Stripe Sigma</a>, <a href="https://stripe.com/data-pipeline">Stripe Data Pipeline</a>, <a href="https://www.nango.dev/">Nango</a>, <a href="https://docs.snowflake.com/en/user-guide/cleanrooms/introduction">Snowflake Data Cleanrooms</a>, every <a href="https://hightouch.com/blog/reverse-etl">reverse ETL</a> platform, <a href="https://quill.co/">Quill</a>&#65129;; these are all parts of a whole. Customers want data from their SaaS vendors. Yet, for the past decade (longer, really), SaaS vendors have been left to roll their own solutions. That is, until now.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZnTe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZnTe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png 424w, https://substackcdn.com/image/fetch/$s_!ZnTe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png 848w, https://substackcdn.com/image/fetch/$s_!ZnTe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png 1272w, https://substackcdn.com/image/fetch/$s_!ZnTe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZnTe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png" width="1180" height="198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:198,&quot;width&quot;:1180,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49904,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZnTe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png 424w, https://substackcdn.com/image/fetch/$s_!ZnTe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png 848w, https://substackcdn.com/image/fetch/$s_!ZnTe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png 1272w, https://substackcdn.com/image/fetch/$s_!ZnTe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea04e92b-69f5-407f-834c-d84393b7644a_1180x198.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://bsky.app/profile/chris.blue/post/3lcnruey4222i">View Post</a></figcaption></figure></div><p>Several startups, including <a href="https://www.prequel.co/">Prequel</a>, <a href="https://www.bobsled.co/">Bobsled</a>, and <a href="http://generalfolders.com">General Folders</a> now seem to be taking this use case seriously. The one I&#8217;m most excited about is <a href="https://www.prequel.co/">Prequel</a>. They call themselves a, &#8220;data export platform.&#8221; The product marketing on their website looks rather reverse ETL-ish, except that the destination&#8217;s connections are meant to be an external organization rather than internal tools such as Salesforce or Zendesk (as is usually the case with reverse ETL). But when I spoke to Prequel, they were quite focused on one particular pattern: exporting data to S3 in Parquet files.</p><p>The idea is actually fairly straightforward: modern data processing is centralizing around data lakehouses using <a href="https://aws.amazon.com/s3/">S3</a>, <a href="https://iceberg.apache.org/">Apache Iceberg</a>, <a href="https://parquet.apache.org/">Apache Parquet</a>, and data lake query engines such as <a href="https://duckdb.org/">DuckDB</a> and <a href="https://trino.io/">Trino</a>. SaaS vendors can upload Parquet files to a shared S3 bucket managed by an Iceberg catalog. Customers can then query the files simply by adding an external table to their existing data warehouse or by querying the data directly using a data lake query engine.</p><p>Offering customers data in a data lake format maintains some of the benefits that SFTP has historically provided: data transfers are very fast, access can be managed centrally in the catalog, and files can be atomically exposed to customers (no partial uploads). This architecture comes with some additional benefits, too. Centralizing around a strongly typed, columnar format like Parquet and table format like Iceberg makes managing data models much easier. And query engines already support this format, so users can query their data directly without having to ETL it.</p><p>Schema evolution challenges still remain. Application engineers might decide to drop a field that customers are using. This problem has existed <a href="https://materializedview.io/p/change-data-capture-still-breaks-db-encapsulatio">for a long time</a>, but it becomes a critical outage when it impacts customers. Fortunately, <a href="https://www.montecarlodata.com/blog-data-contracts-explained/">data contracts</a> are getting attention. <a href="https://www.gable.ai/">Gable.ai</a>&#65129;and its founder, <a href="https://www.linkedin.com/in/chad-sanderson">Chad Sanderson</a>, have been doing a lot of evangelism in this area, which helps. But much like the reverse ETL companies, data contract companies seem to be focused on internal use cases. So when I recently spoke to <a href="https://vakamo.com/">Vakamo</a>, the creators of <a href="https://github.com/lakekeeper/lakekeeper">Lakekeeper</a>, I was very heartened to hear that they&#8217;re thinking about data contracts for data catalogs. They directly called out customer data integration as a use case for this feature.</p><p>I&#8217;m cognizant that a customer data lake doesn&#8217;t address all the use cases I enumerated above. Data is loaded in batch, rather than streamed on web sockets or polled through HTTP APIs. You can have any data format you want, so long as it&#8217;s Parquet. SaaS vendors are still left to roll their own visualization features if they&#8217;re needed. But maybe this is OK. I suspect a data lakehouse with periodic batch loads might be enough.</p><p>Customer data lakes might actually end up being a driving force for lakehouse adoption. Many companies are still in the, &#8220;What is this, why do I need it, and how do I use it,&#8221; phase of lakehouse adoption. These same companies are very motivated to get access to their SaaS data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kwvr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kwvr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png 424w, https://substackcdn.com/image/fetch/$s_!Kwvr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png 848w, https://substackcdn.com/image/fetch/$s_!Kwvr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png 1272w, https://substackcdn.com/image/fetch/$s_!Kwvr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kwvr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png" width="1168" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:1168,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:114729,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kwvr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png 424w, https://substackcdn.com/image/fetch/$s_!Kwvr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png 848w, https://substackcdn.com/image/fetch/$s_!Kwvr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png 1272w, https://substackcdn.com/image/fetch/$s_!Kwvr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757b029-5aaa-4cc8-b58c-29780c48defc_1168x378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://bsky.app/profile/chris.blue/post/3lbahryj54c2v">View Poster</a></figcaption></figure></div><p>Giving a customer&#8217;s analysts DuckDB and pointing them at an S3 bucket is pretty compelling. It could drive a land-and-expand transition for lakehouse architectures as companies initially adopt lakehouse query engines to get access to their SaaS data, then realize the benefits for their own data over time.</p><p>These lakehouses could also give analytics engineers something to do&#8212;something that drives revenue. I <a href="https://materializedview.io/p/merge-analytics-and-data-engineers">recently lamented</a> that the role didn&#8217;t drive enough revenue to justify its existence. But someone&#8217;s going to need to curate these customer data lakehouse data models and make sure the files are loaded on time. Seems like a good fit.</p><p>It&#8217;s still early days for all of this, but I see a lot of tailwinds. Customers want their data, lakehouses seem like a good fit, and we have analytics engineers that are eager to help.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[The Quest for a Distributed POSIX-Compatible Filesystem]]></title><description><![CDATA[Distributed POSIX filesystems have proven elusive, but we're getting closer. Perhaps that's all we need.]]></description><link>https://materializedview.io/p/the-quest-for-a-distributed-posix-fs</link><guid isPermaLink="false">https://materializedview.io/p/the-quest-for-a-distributed-posix-fs</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Thu, 05 Dec 2024 20:32:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3baf0c9c-5053-4589-95d5-408723569c37_1792x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Years ago, I was working on <a href="https://samza.apache.org/">Apache Samza</a> with <a href="https://www.linkedin.com/in/jaykreps/">Jay Kreps</a>. At one point during a discussion about Samza&#8217;s state management system, Jay turned to me and said, &#8220;You know, we wouldn&#8217;t need any of this if we had a distributed filesystem that worked.&#8221; A scalable remote filesystem with normal <a href="https://en.wikipedia.org/wiki/POSIX">POSIX</a> semantics would let us build distributed systems that were stateless services; we could use the distributed filesystem to store everything. This comment stuck with me and I still think about it a lot.</p><p>Alas, we didn&#8217;t have such a system at the time. But object storage systems like <a href="https://aws.amazon.com/s3/">S3</a> have grown to give us all the properties we need; they are <a href="https://www.youtube.com/watch?v=sc3J4McebHE">insanely scalable</a> and provide <a href="https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/">atomic operations</a> needed for transactional workloads. Object stores are still missing POSIX semantics, though. You can&#8217;t take any old system that uses filesystem I/O libraries and use S3 as its storage layer.</p><p>In the absence of a POSIX API for S3, two approaches have emerged to leverage what object stores have to offer . The first is building S3 directly into the systems, which most database and streaming companies are doing. <a href="https://neon.tech">Neon</a>, <a href="https://warpstream.com">WarpStream</a>, <a href="https://turbopuffer.com">Turbopuffer</a>, and <a href="https://responsive.dev">Responsive</a>&#65129; are all in this category. The other is to wrap S3 in a <a href="https://en.wikipedia.org/wiki/Filesystem_in_Userspace">filesystem in userspace</a> (FUSE) interface or an NFS-based implementation. <a href="https://aws.amazon.com/storagegateway/file/s3/">Amazon S3 File Gateway</a>, <a href="https://juicefs.com/">JuiceFS</a>, <a href="https://github.com/s3fs-fuse/s3fs-fuse">s3fs</a>, and <a href="https://github.com/kahing/goofys">Goofys</a> are examples of such an approach. (If you squint, <a href="https://opendal.apache.org/">Apache OpenDAL</a> fits in this category, but inverts the relationship by wrapping everything in its own API.)</p><p>Direct integration requires a storage-layer rewrite. I/O calls must be converted to S3-compatible API calls. Moreover, each system needs to figure out how to deal with higher object storage latencies, a subject I&#8217;ve written about before.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ad046863-fa8f-4174-b6fb-78378ab6d055&quot;,&quot;caption&quot;:&quot;I believe that the future of database persistence is object storage&#8212;S3, Google Cloud Storage, and so on. New systems like Neon, WarpStream [$], and Turbopuffer persist data in object storage to offer infinite retention, durability, replication, data warehouse integration&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Cloud Storage Triad: Latency, Cost, Durability&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:69592459,&quot;name&quot;:&quot;Chris Riccomini&quot;,&quot;bio&quot;:&quot;A software engineer, startup investor, and advisor. I've worked at PayPal, LinkedIn, and WePay. I co-authored The Missing README: A Guide for the New Software Engineer. I've been involved in open source throughout my career and I wrote Apache Samza.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7257377f-ffd3-4455-b305-f2f28f0d31df_1310x1696.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-22T17:23:50.570Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad4afd8d-2d7f-49a4-9741-6a90f872d4b5_1024x1024.webp&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://materializedview.io/p/cloud-storage-triad-latency-cost-durability&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143763656,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:19,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Materialized View&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a9aa647-ffea-4b83-8a65-2f854d4e5de3_720x720.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>Direct object storage integration makes sense for systems like databases, whose primary job is to store and query data. But for systems, frameworks, and libraries that just need to read and write files as part of a broader workload, rewriting the storage layer for object storage is too burdensome.</p><p>FUSE and NFS-based implementations present their own challenges. s3fs and other FUSE-based systems implement only a subset of the POSIX interface, often missing features such as random access writes or appends, metadata operations, atomic renames, hardlinks, and inotify features. Such limitations simply won&#8217;t work for many systems and libraries. It&#8217;s not clear if RocksDB, for example, <a href="https://slatedb.io/docs/faq#how-is-this-different-from-rocksdb-on-efs">can safely be run on EFS</a>. And some implementations, such as JuiceFS, store files in their own block format, which limits interoperability.</p><p>The need for scalable filesystems has grown, too. Database and streaming use cases have been around for a long time, but the growth of Kubernetes and AI workloads are new. Kubernetes has made every system distributed. And with AI, multi-modal data such as audio, text, and video is a critical ingredient. Many ML and AI libraries are built with local filesystems, and those that support object storage often don&#8217;t have good caching to speed up workloads.</p><p>I think we&#8217;re finally getting to the point where we have both the technology and the demand to get us what we need: a distributed POSIX-compatible filesystem. <a href="https://regattastorage.com/">Regatta Storage</a> is building in this direction, with a stated goal of replacing EFS and <a href="https://aws.amazon.com/ebs/">elastic block store</a> (EBS) <a href="https://aws.amazon.com/ebs/general-purpose/">general purpose</a> (gp3) use cases.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T0_1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T0_1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png 424w, https://substackcdn.com/image/fetch/$s_!T0_1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png 848w, https://substackcdn.com/image/fetch/$s_!T0_1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png 1272w, https://substackcdn.com/image/fetch/$s_!T0_1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T0_1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png" width="687" height="161" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:161,&quot;width&quot;:687,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36874,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T0_1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png 424w, https://substackcdn.com/image/fetch/$s_!T0_1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png 848w, https://substackcdn.com/image/fetch/$s_!T0_1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png 1272w, https://substackcdn.com/image/fetch/$s_!T0_1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054d5492-9aba-4c87-8f97-4750c8bcfbee_687x161.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://x.com/jhleath/status/1864688036694642873">View Post</a></figcaption></figure></div><p>Though Regatta doesn&#8217;t have complete POSIX compatibility, their offering is compelling. They are using NFS now, and moving to their own protocol. This should give them a lot of flexibility as they try to implement more obscure POSIX features. Even if they never get complete POSIX support, they should be able to get closer than current solutions. Regatta also provides a generic cache to reduce latency, a key ingredient for a disk-like experience. Such an architecture is akin to a generic version of Neon&#8217;s <a href="https://neon.tech/docs/introduction/architecture-overview">Safekeepers and Pageservers</a>, something <a href="https://bsky.app/profile/archive.chris.blue/post/3l7k2th2pno2l">I&#8217;ve been dreaming of for a while</a>. Plus, unlike JuiceFS, files appear in object storage as normal files rather than opaque blocks that must be accessed only through the JuiceFS interface.</p><p>I expect this area to get more competitive. Much as competition with S3 has driven AWS to improve its offering, I anticipate more EFS features in the future. Other object store providers such as <a href="https://tigrisdata.com/">Tigris</a> &#65129;are also well positioned to pursue this area. JuiceFS and Alluxio will also continue to make progress, too. AI-specific offerings could also emerge. Fortunately, I think the TAM is big enough (and the use cases diverse enough) to support winners for different use cases, even if the underlying technology is largely the same.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[It's Time to Merge Analytics and Data Engineering (Again)]]></title><description><![CDATA[Data pipelines are commoditized and analytics engineers don't provide enough value.]]></description><link>https://materializedview.io/p/merge-analytics-and-data-engineers</link><guid isPermaLink="false">https://materializedview.io/p/merge-analytics-and-data-engineers</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Mon, 18 Nov 2024 19:37:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b06b7af9-2e96-432f-b24a-392e65d338b4_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.linkedin.com/in/joshfgray">Josh Gray</a>, CEO of <a href="https://artemisdata.io">Artemis</a>&#65129;, interviewed me about data stack fragmentation, <a href="https://github.com/dbt-labs/dbt-core">dbt</a>, cost savings, and more. Josh&#8217;s questions really got me thinking, and is the subject of today&#8217;s post. Here&#8217;s the interview:</p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:151623503,&quot;url&quot;:&quot;https://artemisdata.substack.com/p/fragmentation-in-the-data-stack-and&quot;,&quot;publication_id&quot;:3318721,&quot;publication_name&quot;:&quot;Artemis Blog&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43dd7d7-1e8c-4302-a9df-12aecfac7b99_408x408.png&quot;,&quot;title&quot;:&quot;Fragmentation in the Data Stack and cost structure with Chris Riccomini &quot;,&quot;truncated_body_text&quot;:null,&quot;date&quot;:&quot;2024-11-14T15:55:50.661Z&quot;,&quot;like_count&quot;:3,&quot;comment_count&quot;:0,&quot;bylines&quot;:[{&quot;id&quot;:128818922,&quot;name&quot;:&quot;Josh Gray&quot;,&quot;handle&quot;:&quot;joshfgray&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/103cfde5-6777-4abd-95a8-e8f27eb690e9_1200x938.jpeg&quot;,&quot;bio&quot;:null,&quot;profile_set_up_at&quot;:&quot;2023-08-04T19:55:09.996Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:3380868,&quot;user_id&quot;:128818922,&quot;publication_id&quot;:3318721,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:3318721,&quot;name&quot;:&quot;Artemis Blog&quot;,&quot;subdomain&quot;:&quot;artemisdata&quot;,&quot;custom_domain&quot;:&quot;blog.artemisdata.io&quot;,&quot;custom_domain_optional&quot;:true,&quot;hero_text&quot;:&quot;Behind the Scenes &#8212; Updates from the Artemis team&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d43dd7d7-1e8c-4302-a9df-12aecfac7b99_408x408.png&quot;,&quot;author_id&quot;:128818922,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2024-11-08T23:46:27.818Z&quot;,&quot;rss_website_url&quot;:null,&quot;email_from_name&quot;:&quot;Josh from Artemis&quot;,&quot;copyright&quot;:&quot;Josh Gray&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;is_personal_mode&quot;:false}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:false,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://artemisdata.substack.com/p/fragmentation-in-the-data-stack-and?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!U4NJ!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43dd7d7-1e8c-4302-a9df-12aecfac7b99_408x408.png"><span class="embedded-post-publication-name">Artemis Blog</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">Fragmentation in the Data Stack and cost structure with Chris Riccomini </div></div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">a year ago &#183; 3 likes &#183; Josh Gray</div></a></div><p>I also did a podcast interview on <a href="https://podcasts.apple.com/us/podcast/the-joe-reis-show/id1676305617">The Joe Reis Show</a>. Joe and I had a leisurely talk about all things data infrastructure. Best enjoyed in a hammock with a beverage of choice.</p><div class="apple-podcast-container" data-component-name="ApplePodcastToDom"><iframe class="apple-podcast " data-attrs="{&quot;url&quot;:&quot;https://embed.podcasts.apple.com/us/podcast/chris-riccomini-building-and-writing-about-data/id1676305617?i=1000676786547&quot;,&quot;isEpisode&quot;:true,&quot;imageUrl&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/podcast-episode_1000676786547.jpg&quot;,&quot;title&quot;:&quot;Chris Riccomini - Building (and Writing About) Data Intensive Applications&quot;,&quot;podcastTitle&quot;:&quot;The Joe Reis Show&quot;,&quot;podcastByline&quot;:&quot;&quot;,&quot;duration&quot;:2865000,&quot;numEpisodes&quot;:&quot;&quot;,&quot;targetUrl&quot;:&quot;https://podcasts.apple.com/us/podcast/chris-riccomini-building-and-writing-about-data/id1676305617?i=1000676786547&amp;uo=4&quot;,&quot;releaseDate&quot;:&quot;2024-11-13T10:39:00Z&quot;}" src="https://embed.podcasts.apple.com/us/podcast/chris-riccomini-building-and-writing-about-data/id1676305617?i=1000676786547" frameborder="0" allow="autoplay *; encrypted-media *;" allowfullscreen="true"></iframe></div><div><hr></div><p>I <a href="https://bsky.app/profile/chris.blue/post/3lb3lprtbr22v">posted a Bluesky thread this weekend</a> arguing that analytics engineers and data engineers should be folded back into a single role. I decided to make the argument after coming across <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Benn Stancil&quot;,&quot;id&quot;:5667744,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/a317e60a-9bd1-4c75-bb54-66d517f735dc_1100x1100.jpeg&quot;,&quot;uuid&quot;:&quot;ae855a9a-f51d-4e73-874d-dd7e7364e44e&quot;}" data-component-name="MentionToDOM"></span>&#8217;s post, <a href="https://benn.substack.com/p/disband-the-analytics-team">Disband the analytics team</a>. The post wrestles with the struggles that analytics teams face and offers the choice: disband or rebrand.</p><blockquote><p>In short, my answer is that analytics&#8212;not as an industry or as a technology ecosystem, but <em>as a discipline</em>&#8212;might not work. The average company may never be able to make better decisions by hiring a team of average analysts. We can <a href="https://benn.substack.com/p/data-is-for-dashboards">make dashboards</a> and be <a href="https://benn.substack.com/p/the-end-of-our-purple-era">operational accountants</a>. But the fun, exploratory, &#8220;valuable&#8221; work may always be an indulgent, empty dessert, and never the entr&#233;e we want it to be. &#8212; <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Benn Stancil&quot;,&quot;id&quot;:5667744,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/a317e60a-9bd1-4c75-bb54-66d517f735dc_1100x1100.jpeg&quot;,&quot;uuid&quot;:&quot;818d2250-2c9d-4b88-9357-d9b8e991e3d5&quot;}" data-component-name="MentionToDOM"></span>, <a href="https://benn.substack.com/p/disband-the-analytics-team">Disband the analytics team</a></p></blockquote><p>I&#8217;ve <a href="https://bsky.app/profile/archive.chris.blue/post/3l7htj4huw42b">long held</a> that creating the &#8220;analytics engineer&#8221; role was a mistake. <a href="https://www.getdbt.com/what-is-analytics-engineering">dbtLabs says</a>, &#8220;Analytics engineers provide clean data sets to end users, modeling data in a way that empowers end users to answer their own questions.&#8221; I don&#8217;t believe that this set of activities is enough value to justify a full headcount; it&#8217;s is too limited in scope and too far removed from revenue generation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aC0w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aC0w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png 424w, https://substackcdn.com/image/fetch/$s_!aC0w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png 848w, https://substackcdn.com/image/fetch/$s_!aC0w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png 1272w, https://substackcdn.com/image/fetch/$s_!aC0w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aC0w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png" width="728" height="377.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:755,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:181382,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aC0w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png 424w, https://substackcdn.com/image/fetch/$s_!aC0w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png 848w, https://substackcdn.com/image/fetch/$s_!aC0w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png 1272w, https://substackcdn.com/image/fetch/$s_!aC0w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d063aa6-0fa9-4181-8c7b-d50e0b3894b0_1462x758.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://bsky.app/profile/archive.chris.blue/post/3l7htj4huw42b">View Post</a></figcaption></figure></div><p>Extracting, transforming, and loading (ETL&#8217;ing) data used to be handled by one team: the data warehouse team. But several trends have encouraged a schism in warehouse teams. Some&#8212;data engineers&#8212;now work on data pipelines (extract and load) while others work on data marts (&#8220;clean data sets&#8221;, as dbtLabs calls them). In short, data engineers do the E and L, and analytics engineers do the T. Many trends contributed to this bifurcation.</p><ul><li><p>We switched from ETL to ELT when we adopted data lake architectures. Dumping garbage into an object store made it easy for data engineers to ignore transformations and gave analytics engineers something to do. </p></li><li><p>Similarly, adopting data integration with Kafka and Kafka Connect greatly expanded the number (and importance) of data pipelines in an organization, which gave the data engineers something to do as well.</p></li><li><p><a href="https://en.wikipedia.org/wiki/Shift-left_testing">Shift-left</a> became a data philosophy that encouraged everyone to be their own analyst, which left analysts squeezed.</p></li><li><p><a href="https://en.wikipedia.org/wiki/Zero_interest-rate_policy">ZIRP</a> ended, which made CFOs take a hard look at the cost of analyst and data teams, which further squeezed analysts. </p></li><li><p>dbtLabs, <a href="https://motherduck.com">Motherduck</a>, and other <a href="https://www.moderndatastack.xyz/">MDS vendors</a> were all too willing to create a new role to sell their products, which dovetailed nicely with analyst&#8217;s desire to <a href="https://bsky.app/profile/chad-isenberg.bsky.social/post/3lb42pnf2ms2t">be engineers and get paid more</a>.</p></li><li><p>LLMs are replacing analysts in some cases. Screech all you want, but it&#8217;s happening. There are dozens of data chatbots now (<a href="https://www.cimba.ai/">Cimba.ai</a>&#65129;, <a href="https://datachat.ai/">DataChat</a>, <a href="https://julius.ai/">Julius.ai</a>), and LLMs write pretty good SQL.</p></li></ul><p>We&#8217;re in a new world now, though. ZIRP is gone, most of the connectors that data engineers were working on have now been built, and there are <a href="https://fivetran.com">many</a> <a href="https://airbyte.com">vendors</a> <a href="https://www.decodable.co/">you</a> can pay to run your data pipeline, and chatbots can answer data questions. It&#8217;s time to merge data engineers and analytics engineers back into a single data team that&#8217;s responsible for E, T, and L.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ul5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ul5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png 424w, https://substackcdn.com/image/fetch/$s_!4ul5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png 848w, https://substackcdn.com/image/fetch/$s_!4ul5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png 1272w, https://substackcdn.com/image/fetch/$s_!4ul5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ul5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png" width="1456" height="801" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:801,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238420,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4ul5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png 424w, https://substackcdn.com/image/fetch/$s_!4ul5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png 848w, https://substackcdn.com/image/fetch/$s_!4ul5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png 1272w, https://substackcdn.com/image/fetch/$s_!4ul5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F009de8c6-2534-47eb-9f80-f31f9e7ff20a_1466x806.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://bsky.app/profile/chris.blue/post/3lb3lptvku22v">View Post</a></figcaption></figure></div><p>I&#8217;m happy to see companies and projects showing up to ease this transition. The most notable one is <a href="https://dlthub.com/">dltHub</a>, which adopts <a href="https://aws.amazon.com/what-is/sdlc/">SDLC</a> best practices for data pipelines much as dbt did for transformations. Tools such as this should make it easier for analytics engineers to take ownership of data pipelines. I&#8217;ve also seen several tools like <a href="https://tabsdata.com">tabsdata</a>&#65129; that merge ETL back into a single tool for analytics and data engineers, rather than having both dltHub and dbt. I expect to see a collapse of data engineering and analytics engineering back to a single team in the next few years.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[DuckDB Is Not a Data Warehouse]]></title><description><![CDATA[DuckDB is a tool, not a product.]]></description><link>https://materializedview.io/p/duckdb-is-not-a-data-warehouse</link><guid isPermaLink="false">https://materializedview.io/p/duckdb-is-not-a-data-warehouse</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Mon, 04 Nov 2024 12:37:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/82a33fa9-3f95-4bd0-9592-690e2b565702_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Before I get to DuckDB, I&#8217;ve got three house-cleaning items this week: Bluesky, Materialized View&#8217;s one year anniversary, and P99 CONF.</p><p>Let&#8217;s begin with social media. I&#8217;ve moved to <a href="https://bsky.app">Bluesky</a> &#129419;. Follow me <a href="https://bsky.app/profile/chris.blue">@chris.blue</a> if you&#8217;ve enjoyed my Twitter posts over the past 15 years. You can crosspost with <a href="https://fedica.com/">Fedica</a> or <a href="https://buffer.com/">Buffer</a> if you like. There are some great <a href="https://bsky.app/search?q=data+go.bsky.app">starter packs</a> to bootstrap your feed, too. Here are a few:</p><ul><li><p><a href="https://bsky.app/starter-pack-short/SCZe42X">Infrastructure Engineers</a> (mine)</p></li><li><p><a href="https://bsky.app/starter-pack/embano1.mgasch.com/3l7i37p3prf2p">Distributed Systems on &#129419;</a></p></li><li><p><a href="https://bsky.app/starter-pack-short/2RLCt3J">AI, ML, Data Science, and Security</a></p></li><li><p><a href="https://bsky.app/starter-pack-short/EjGVoDC">ML, Data, &amp; Tech</a></p></li><li><p><a href="https://bsky.app/starter-pack-short/8TdEfdK">Data People Starter Pack</a></p></li><li><p><a href="https://bsky.app/starter-pack-short/7D4NApV">AI and Data</a></p></li></ul><p>I don&#8217;t know what this means for my Twitter account. All I can say is that I&#8217;ve been using Bluesky exclusively for the past week and it&#8217;s absolutely buzzing. It feels like the good old days. I haven&#8217;t missed Twitter at all.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HBtw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HBtw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png 424w, https://substackcdn.com/image/fetch/$s_!HBtw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png 848w, https://substackcdn.com/image/fetch/$s_!HBtw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png 1272w, https://substackcdn.com/image/fetch/$s_!HBtw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HBtw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png" width="728" height="194.62605752961082" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:316,&quot;width&quot;:1182,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:92632,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HBtw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png 424w, https://substackcdn.com/image/fetch/$s_!HBtw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png 848w, https://substackcdn.com/image/fetch/$s_!HBtw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png 1272w, https://substackcdn.com/image/fetch/$s_!HBtw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8043ec6e-08f7-43fb-a1c1-e39ff7f48baa_1182x316.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://bsky.app/profile/chris.blue/post/3l7lqkhapiz27">View Post</a></figcaption></figure></div><p>Next, Materialized View <a href="https://materializedview.io/p/hello-world">turned one</a> on October 31 &#127875;. It&#8217;s been an incredible year for the newsletter. I&#8217;ve published <a href="https://materializedview.io/archive">50 posts</a> and the newsletter just passed 4,000 subscribers. I&#8217;ve also received a lot of positive feedback. Almost everyone I meet mentions Materialized View. Thanks again for all the support and encouragement.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LnrR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LnrR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png 424w, https://substackcdn.com/image/fetch/$s_!LnrR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png 848w, https://substackcdn.com/image/fetch/$s_!LnrR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png 1272w, https://substackcdn.com/image/fetch/$s_!LnrR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LnrR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png" width="1456" height="564" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:564,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:118040,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LnrR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png 424w, https://substackcdn.com/image/fetch/$s_!LnrR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png 848w, https://substackcdn.com/image/fetch/$s_!LnrR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png 1272w, https://substackcdn.com/image/fetch/$s_!LnrR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f91c712-e954-4afd-b906-25a0e9a708fb_2312x896.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Finally, <a href="https://www.linkedin.com/in/rohanpd/">Rohan Desai</a> and I presented at <a href="https://www.p99conf.io/">P99 CONF</a> and the video is now online. Along with my <a href="https://www.youtube.com/watch?v=wEAcNoJOBFI">The Geek Narrator</a> interview, our talk is a great starting point to learn about <a href="https://slatedb.io">SlateDB&#8217;s</a> internals.</p><div id="youtube2-8L_4kWhdzNc" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;8L_4kWhdzNc&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/8L_4kWhdzNc?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div><hr></div><p>A consequence of drinking from the <a href="https://bsky.app/profile/gecky.me/feed/aaans52l2ufzc">dataBS</a> firehose is that you will get a lot of <a href="https://duckdb.org/">DuckDB</a> chatter. For the unfamiliar, DuckDB is essentially <a href="https://www.sqlite.org/">SQLite</a> for columnar data. It has a number of interesting properties. It&#8217;s very portable: it runs locally on your laptop, inside an application, or even in a browser. It&#8217;s also <a href="https://benchmark.clickhouse.com/">very fast</a> (though, I&#8217;m told <a href="https://motherduck.com/blog/perf-is-not-enough/">that&#8217;s not enough</a>). Most importantly, it can connect to remote storage to read <a href="https://duckdb.org/docs/data/parquet/overview.html">Apache Parquet files</a> and <a href="https://duckdb.org/docs/extensions/iceberg.html">Apache Iceberg tables</a>.</p><p>These properties have made DuckDB a favorite among analytics and data engineers. All kinds of creative DuckDB uses have popped up. <a href="https://www.okta.com/">Okta</a> <a href="https://www.datacouncil.ai/talks/processing-trillions-of-records-at-okta-with-mini-serverless-databases">uses DuckDB to cheaply transform data</a> before it enters <a href="https://www.snowflake.com/">Snowflake</a>. MotherDuck also <a href="https://motherduck.com/blog/motherduck-kestra-etl-pipelines/">showcases ETL examples</a>. <a href="https://www.rilldata.com/blog/why-we-built-rill-with-duckdb">Rill</a> and <a href="https://mode.com/blog/how-we-switched-in-memory-data-engine-to-duck-db-to-boost-visual-data-exploration-speed">Mode</a> have both adopted DuckDB as their in-memory query engines. PostgreSQL has been overrun with DuckDb extensions such as <a href="https://github.com/duckdb/pg_duckdb">pg_duckdb</a>, <a href="https://mooncake.dev/">pg_mooncake</a>, and <a href="https://github.com/paradedb/pg_analytics">pg_analytics</a>. You can even use DuckDB to query <a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page">New York City taxi data</a> straight from your laptop (or <a href="https://modal.com/docs/examples/s3_bucket_mount">from Modal</a>&#65129;).</p><p>Given that DuckDB is an <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">online analytical processing (OLAP)</a> database, you might expect to see stories of DuckDB replacing <a href="https://snowflake.com">Snowflake</a>, <a href="https://aws.amazon.com/redshift/">Redshift</a>, <a href="https://cloud.google.com/bigquery">BigQuery</a>, or <a href="https://www.databricks.com/">Databricks</a> as a data warehouse. There are <a href="https://www.definite.app/blog/duckdb-datawarehouse">some</a>, but not many. I&#8217;ve always been skeptical of the idea that DuckDB is a viable solution for an enterprise data warehouse. <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Ananth Packkildurai&quot;,&quot;id&quot;:3520227,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/4f38fa68-8a30-4357-a48e-6833efe28c0f_989x989.jpeg&quot;,&quot;uuid&quot;:&quot;8bd1876e-1cfa-4166-9516-3e9835148c15&quot;}" data-component-name="MentionToDOM"></span> (of <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Data Engineering Weekly&quot;,&quot;id&quot;:73271,&quot;type&quot;:&quot;pub&quot;,&quot;url&quot;:&quot;https://open.substack.com/pub/dataengineeringweekly&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/6f51c1bf-abc9-4cd3-ad69-22cc8e3f1ef2_1080x1080.png&quot;,&quot;uuid&quot;:&quot;89a9d907-fe6e-4e01-a7cb-e7db696da100&quot;}" data-component-name="MentionToDOM"></span> fame) posted an observation that resonated with me:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kO13!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kO13!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png 424w, https://substackcdn.com/image/fetch/$s_!kO13!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png 848w, https://substackcdn.com/image/fetch/$s_!kO13!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png 1272w, https://substackcdn.com/image/fetch/$s_!kO13!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kO13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png" width="728" height="434.8122866894198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:700,&quot;width&quot;:1172,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:198122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kO13!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png 424w, https://substackcdn.com/image/fetch/$s_!kO13!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png 848w, https://substackcdn.com/image/fetch/$s_!kO13!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png 1272w, https://substackcdn.com/image/fetch/$s_!kO13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadefdbbd-d2c7-4041-b793-1d04fc4c3bca_1172x700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://bsky.app/profile/chris.blue/post/3l7tyr2llko2f">View Post</a></figcaption></figure></div><p>DuckDB&#8217;s deployment model and limited scalability are what I struggle with. If you&#8217;re in an enterprise, your data warehouse users are going to include product managers, customer support, risk analysts, business analysts, finance teams, operations teams&#8212;virtual everyone at the company. I don&#8217;t see how DuckDB can be deployed in such an organization. It&#8217;s untenable to install DuckDB on everyone&#8217;s laptop, grant everyone access to data lake buckets, and ask them to run queries from the CLI. </p><p>Even if a company wanted to use DuckDB as their data warehouse, they couldn&#8217;t. DuckDB can&#8217;t handle the largest queries an enterprise might wish to run. MotherDuck <a href="https://motherduck.com/blog/redshift-files-hunt-for-big-data/">has rightly pointed out</a> that <a href="https://motherduck.com/blog/big-data-is-dead/">most queries are small</a>. What they don&#8217;t say is that the most valuable queries in an organization <em>are</em> large: financial reconciliation, recommendation systems, advertising, and others. These are the revenue drivers. They might comprise a minority of all the queries an organization runs, but they make the money. DuckDB just can&#8217;t handle such queries.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gWEu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gWEu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png 424w, https://substackcdn.com/image/fetch/$s_!gWEu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png 848w, https://substackcdn.com/image/fetch/$s_!gWEu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!gWEu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gWEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png" width="1384" height="1064" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1064,&quot;width&quot;:1384,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:330595,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gWEu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png 424w, https://substackcdn.com/image/fetch/$s_!gWEu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png 848w, https://substackcdn.com/image/fetch/$s_!gWEu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!gWEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75cd778-4993-468f-9492-822cc7c4cc87_1384x1064.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/bernhardsson/status/1595603484736380928">View Post</a></figcaption></figure></div><p>To be a viable data warehouse, DuckDB needs a centralized deployment model, a better UI, and a way to scale. This is exactly what MotherDuck is building, and it sounds a lot like Snowflake or BigQuery. As much as MotherDuck would like to be the DuckDB vendor, they&#8217;re a cloud data warehouse that just happens to use DuckDB.</p><p>This begs the question: why should I switch from my current data warehouse to MotherDuck? It seems like the answer right now is cost. Cloud data warehouses are expensive. MotherDuck saves money by running DuckDB on small data sets. But it&#8217;s also really expensive to change data warehouses. It&#8217;s often easier to cut costs in your existing data warehouse by auditing queries and data retention.</p><p>Smaller companies can adopt DuckDB or MotherDuck and scale cheaply as they grow. This is a reasonable story for SMBs, but not for enterprises that already have a warehouse. But SMBs can also adopt the PostgreSQL extensions that I mentioned earlier. If I were tasked with rolling out DuckDB in an organization, that&#8217;s probably how I&#8217;d do it.</p><p>So, on the one hand, MotherDuck has picked a fight with some of the nastiest apex predators out there: Snowflake, BigQuery, and Databricks. On the other, they&#8217;re getting squeezed by PostgreSQL extensions and DuckDB on the laptop. This is a tough environment. MotherDuck has raised <a href="https://techcrunch.com/2023/09/20/database-startup-motherduck-lands-52-5m-to-grow-its-duckdb-based-platform">a lot of money</a>, so perhaps they can find enough SMB customers and wait for them to scale.</p><p>As for DuckDB itself, I think <a href="https://bsky.app/profile/pedram.lol">Pedram</a> and Erik have it right in the Tweet above. It&#8217;s amazing middleware, much like SQLite. I don&#8217;t see it as a data warehouse, though.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[Virtual Machines Are Getting Better]]></title><description><![CDATA[Unikernels, GPU checkpointing, and VM migration are going to reshape the cloud.]]></description><link>https://materializedview.io/p/virtual-machines-are-getting-better</link><guid isPermaLink="false">https://materializedview.io/p/virtual-machines-are-getting-better</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Wed, 23 Oct 2024 12:11:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c47142fe-4028-4dd3-86e6-13e8b9574f08_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Virtual machines (VMs), containers, and serverless execution models have been around for a while. Until recently, these technologies have offered fairly generic features. Consequently, they <em>kind of</em> work for many use cases, but <em>truly</em> work for relatively few.</p><p>Microservices, serverless functions, AI, batch computation, stateful services, and other use cases all need special features to work well. Serverless functions need fast start times, AI and LLM workloads need GPUs, and many use cases could benefit from better state snapshotting technology. Yet, serverless functions suffer from high cold starts <a href="https://docs.aws.amazon.com/lambda/latest/operatorguide/execution-environments.html#cold-start-latency">between 100ms and 1 second</a>. Snapshots are slow, and only migrate certain pieces of state such as memory, while leaving other pieces, such as GPU state or network addresses behind. And GPU support in these execution models is spotty at best.</p><p>VM snapshots are useful when recovering from a failure or migrating execution to a new machine. Moving your workload to a different machine (or cloud) could save money, unlock better GPUs, or speed up training and inference. <a href="https://www.cedana.ai/">Cedana</a>&#65129;, <a href="https://modal.com/">Modal</a>&#65129;, and <a href="https://microsoft.com">Microsoft</a> have been <a href="https://arxiv.org/pdf/2202.07848">working on this problem</a> for some time.</p><p>This is why I&#8217;m excited to see a spate of recent developments that target cold start, snapshot, and GPU requirements.</p><p>GPU features are evolving quickly. <a href="https://gvisor.dev/">gVisor</a> <a href="https://gvisor.dev/docs/user_guide/gpu/">added GPU support</a> last year, <a href="https://www.nvidia.com/">NVIDIA</a> recently open sourced their <a href="https://github.com/NVIDIA/cuda-checkpoint">cuda-checkpoint</a> tool, and <a href="https://firecracker-microvm.github.io/">Firecracker</a> <a href="https://github.com/firecracker-microvm/firecracker/issues/1179">had a meeting</a> on October 9th, 2024 to discuss GPU support.</p><p>NVIDIA&#8217;s cuda-checkpoint is particularly important. <a href="https://developer.nvidia.com/cuda-toolkit">CUDA</a> offers GPU APIs meant for generic computation (not gaming). Such APIs are widely used in AI models and LLMs. As developers execute operations on a GPU, the GPU&#8217;s memory accumulates state. This GPU data is very difficult to read directly, which poses a problem if you wish to snapshot a machine&#8217;s state. Now cuda-checkpoint offers a simple, bare-bones, free tool to do GPU checkpoint and recovery.</p><p>For serverless functions, <a href="https://en.wikipedia.org/wiki/Unikernel">unikernels</a> such as <a href="https://unikraft.org">Unikraft</a> now boast <a href="https://unikraft.org/docs/concepts/performance">single digit boot times</a> and <a href="https://unikraft.cloud/how-it-works/">fast snapshotting</a>. This should enable faster cold starts and scale-to-zero, which will result in cost savings. Many unikernels tout <a href="https://unikraft.org/docs/concepts/security">increased</a> <a href="https://nanovms.com/security">security</a> in multi-tenant environments, as well. As unikernels add <a href="https://kubernetes.io/">Kubernetes</a> support, I expect adoption to increase, so non-serverless workloads like microservices will benefit. I&#8217;ve also heard there are benefits for specific verticals such as gaming.</p><p>Meanwhile, <a href="https://loopholelabs.io/">Loophole Labs</a> has launched <a href="https://architect.run">Architect</a> to simplify VM migrations. They&#8217;ve built some really slick tech that incrementally snapshots and migrates both memory state and network bindings between machines. Architect purports to migrate faster than a spot instance preemption occurs, which would allow for huge cost savings for many workloads.</p><p>These developments will have a big impact on how we think about the cloud. We will, for example, see more multi-cloud adoption as state is easier to migrate and GPUs are harder to come by. <a href="https://yingjunwu.github.io/">Yingjun Wu</a> (Founder of <a href="https://risingwave.com/">RisingWave</a>) pointed me to <a href="https://sky.cs.berkeley.edu/project/skypilot/">SkyPilot</a>, a <a href="https://www.berkeley.edu/">UC Berkeley</a> project that offers multi-cloud deployment specifically for AI, LLM, and batch workloads. Serverless functions might supplant service oriented applications, too. <a href="https://vercel.com/">Vercel</a> has been doing <a href="https://x.com/criccomini/status/1846684272608035244">great work here</a>. These are big shifts with big implications.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[Small Batch, Artisanal ETL is Back]]></title><description><![CDATA[Batch ETL is back, and we're brining everyone this time!]]></description><link>https://materializedview.io/p/small-batch-artisanal-etl-is-back</link><guid isPermaLink="false">https://materializedview.io/p/small-batch-artisanal-etl-is-back</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Thu, 10 Oct 2024 09:59:35 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/427b21c6-c467-4638-871b-30206122916e_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I joined <a href="https://x.com/thegeeknarrator">The Geek Narrator</a> to talk about <a href="https://slatedb.io">SlateDB</a> last week. This is the most complete description of what SlateDB is, why we built it, and how it works (at least until our <a href="https://www.p99conf.io/">P99 CONF</a> talk later this month). Check out the video below for the full interview:</p><div id="youtube2-wEAcNoJOBFI" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;wEAcNoJOBFI&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/wEAcNoJOBFI?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Speaking of P99 CONF, I did a little <a href="https://www.p99conf.io/2024/10/03/p99-conf-speaker-spotlight-chris-riccomini/">speaker spotlight</a> with them last week. I talk about the projects I&#8217;m working on and which talks I&#8217;m most excited about at P99 CONF.</p><div><hr></div><p>In 2019, I gave a <a href="https://qconsf.com/">QCon</a> talk entitled <a href="https://www.youtube.com/watch?v=ZZr9oE4Oa5U">The Future of Data Engineering</a>. It discussed the evolution of our data pipeline at <a href="https://wepay.com/">WePay</a> over the years. The talk presents a series of steps that most organizations go through when building pipelines:</p><ul><li><p>Stage 0: None</p></li><li><p>Stage 1: Batch</p></li><li><p>Stage 2: Realtime</p></li><li><p>Stage 3: Integration</p></li><li><p>Stage 4: Automation</p></li><li><p>Stage 5: Decentralization</p></li></ul><p>I won&#8217;t discuss each one of these (see <a href="https://www.youtube.com/watch?v=ZZr9oE4Oa5U">the video</a> or read <a href="https://www.infoq.com/news/2019/11/data-engineering-future-qconsf/">the blog</a>), but one point I make in the talk is that it&#8217;s reasonable, if not suggested, to take your time with this evolution. It&#8217;s possible you won&#8217;t even need the latter stages. In fact, WePay had a batch data pipeline&#8212;stage 2&#8212;running on <a href="https://airflow.apache.org">Airflow</a> nearly the entire time I was there. It ran at 15m intervals and it was remarkably effective.</p><p>These days, the ETL ecosystem is very diverse. An enterprise can still stand up an ETL pipeline like WePay&#8217;s using Airflow, <a href="https://prefect.io">Prefect</a>, or <a href="https://dagster.io/">Dagster</a>. Or an enterprise might choose to adopt a data integration platform such as <a href="https://airbyte.com/">Airbyte</a>, <a href="https://www.fivetran.com/">Fivetran</a>, or <a href="https://docs.confluent.io/platform/current/connect/index.html">Kafka Connect</a>. Many cloud providers offer one-click integration between their cloud SQL services and their data warehouses as well.</p><p>For many smaller organizations, these tools are overkill. Though I don&#8217;t fully agree with <a href="https://motherduck.com/">MotherDuck</a>&#8217;s <a href="https://motherduck.com/blog/big-data-is-dead/">Big Data is Dead</a> thesis, they&#8217;re right that a lot of organizations just have a few terabytes of data. Developers are thinking: if I can get by with a small query engine (<a href="https://duckdb.org/">DuckDB</a>), maybe I can get by with a small ETL tool.</p><p>Pipeline ownership has shifted as well. WePay had a team of data engineers to manage our data pipelines. These days, we&#8217;re asking ML, AI, analytics, and product engineers to build data pipelines. These users are used to interacting with data in a different way. They like to transform data in batch through SQL and Python with tools like <a href="https://www.getdbt.com/">dbt</a>, <a href="https://jupyter.org/">Jupyter notebooks</a>, and <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html">Pandas data frames</a>. They want an ETL tool that integrates with this flow.</p><p>The closest thing that I&#8217;ve come across so far is <a href="https://github.com/dlt-hub/dlt">data load tool (dlt)</a>. As the name suggests, dlt is philosophically similar to dbt; <a href="https://dlthub.com/why-dlt">it&#8217;s trying to give these new users a tool that fits into their flow</a>, but that incorporates software development life cycle (SDLC) best practices. Unlike dbt, which focuses more on transformations, dlt offers sources and sinks and a simple API to work with:</p><blockquote><p>The proliferation of Python libraries such as Pandas, Jupyter notebooks, NumPy or PyTorch revolutionised the ML/AI space by allowing millions of practitioners to actively build the ecosystem.</p><p>We aim to bring such revolution to the data space. dlt is a pip-installable, minimalistic library that anyone writing Python code can use. dlt enables those people to create new datasets and move them to tools they use - be it other Python projects or engines or tooling from the Modern Data Stack.</p></blockquote><p>dlt fits nicely into many so-called <a href="https://www.moderndatastack.xyz/">modern data stack (MDS)</a> tools out there, but also runs on its own. It&#8217;s a nice stepping stone from step 1 to step 2 in my stage-list, above. </p><p>There are challenges to this approach. If everyone is managing their own data pipelines, there will duplication. There is also likely to be inefficiency as engineers opt for simpler solutions rather than cheaper ones. Moreover, the pipelines could become a tangled mess with no clear ownership as employees come and go. These challenges are probably familiar to anyone working with dbt in a large organization.</p><p>But I think dlt has recognized something: data consumers want control of their data pipelines. I&#8217;m excited to see where this goes. There&#8217;s a lot of work to be done to get this right.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> and <a href="https://materializedview.capital">Materialized View Capital</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[The New Era of Flexible Infrastructure Deployment]]></title><description><![CDATA[Flexible deployment is now table stakes. Infrastructure must run embedded, client-side, single-node, clustered, as SaaS, BYOC, and self-hosted.]]></description><link>https://materializedview.io/p/the-new-era-of-flexible-infrastructure</link><guid isPermaLink="false">https://materializedview.io/p/the-new-era-of-flexible-infrastructure</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Thu, 26 Sep 2024 18:19:49 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/02249fc8-a5b3-451d-90d1-05f399ab1e42_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.confluent.io/">Confluent&#8217;s</a> recent <a href="https://www.warpstream.com/">WarpStream</a>&#65129; <a href="https://www.confluent.io/blog/confluent-acquires-warpstream/">acquisition</a> came as a surprise to me. Why buy WarpStream four months after announcing <a href="https://www.confluent.io/blog/introducing-confluent-cloud-freight-clusters/">Freight Clusters</a>, a direct competitor to WarpStream? Confluent says WarpStream will enable bring your own cloud (BYOC) deployment for its customers.</p><p>I was a bit skeptical of Confluent&#8217;s messaging at first&#8212;acquisition motivation and communication are rarely the same. But I&#8217;m coming around to the idea. Freight can&#8217;t support BYOC deployment because it&#8217;s built on <a href="https://www.confluent.io/blog/cloud-native-data-streaming-kafka-engine/">Kora</a>, the proprietary system confluent built for its SaaS offering. Kora is not easily deployed in a customer&#8217;s environment, whereas WarpStream is.</p><p>Confluent can now offer self-hosted, BYOC, and cloud versions of their product. I suspect their customers were asking for BYOC, a trend I expect to increase (see <a href="https://materializedview.io/p/bring-your-own-cloud-nuon-and-hosted">Bring Your Own Cloud, Nuon&#65129;, and Hosted SaaS Challenges With Jon Morehouse</a>). Confluent&#8217;s product offering is indicative of what customers want: flexible deployment options. There&#8217;s no one-size-fits-all deployment offering.</p><p>But Confluent is now running (and supporting) three different versions of Kafka: self-managed open-source Kafka, Confluent Cloud with Kora, and WarpStream. If Confluent were starting from scratch now, I don&#8217;t think they would have built things this way. While there&#8217;s no one-size-fits-all deployment offering, I suspect we can still build a single system that can accommodate flexible deployment. Doing so would provide a much better user (and vendor) experience.</p><p><a href="https://clickhouse.com/">ClickHouse</a> is an example of such a system. I really didn&#8217;t understand why everyone was so excited about ClickHouse until I dug into it in my recent post, <a href="https://materializedview.io/p/unpacking-the-buzz-around-clickhouse">Unpacking the Buzz around ClickHouse</a>. Then it clicked: you can run ClickHouse as a single node. In hindsight, it&#8217;s a comically obvious gap in the realtime online analytical processing (OLAP) space.</p><p>ClickHouse is more than a single-node binary, though. It can also be run <a href="https://clickhouse.com/docs/en/architecture/cluster-deployment">in a cluster</a>, or embedded as an in-process library with <a href="https://github.com/chdb-io/chdb">chDB</a>. The company also offers a <a href="https://clickhouse.com/cloud">SaaS cloud</a>, and <a href="https://clickhouse.com/cloud/bring-your-own-cloud">BYOC</a> deployment products. This is an incredible amount of flexibility for one system. It scales from a single in-process library all the way to a distributed cluster. And it can be deployed as a self-managed service, BYOC, or as SaaS. Its highly efficient single-node binary can also scale up nicely, too. This is the future.</p><p>Once this clicked, I started seeing the trend everywhere. PostgreSQL now has <a href="https://pglite.dev/">pglite.dev</a> for client-side and edge deployment. <a href="https://neon.tech/">Neon</a> has added an object storage tier (granted, it&#8217;s a fork when I last checked). <a href="https://motherduck.com/">MotherDuck</a> is trying to stretch <a href="https://duckdb.org/">DuckDB</a> from analyst laptops to the cloud. <a href="https://www.getdaft.io/">Daft</a>&#65129;&#8217;s next-generation query engine is going after not only <a href="https://spark.apache.org/">Spark</a>-scale workloads, but DuckDB as well.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n5_h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n5_h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png 424w, https://substackcdn.com/image/fetch/$s_!n5_h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png 848w, https://substackcdn.com/image/fetch/$s_!n5_h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png 1272w, https://substackcdn.com/image/fetch/$s_!n5_h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n5_h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png" width="688" height="511" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:511,&quot;width&quot;:688,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:114347,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n5_h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png 424w, https://substackcdn.com/image/fetch/$s_!n5_h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png 848w, https://substackcdn.com/image/fetch/$s_!n5_h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png 1272w, https://substackcdn.com/image/fetch/$s_!n5_h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72655317-26f2-4d70-a748-5bb69bbb6294_688x511.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/criccomini/status/1838730044707144125">View Post</a></figcaption></figure></div><p>There are many reasons for this shift. On the product side of the fence, we&#8217;re finally understanding what infrastructure customers want: flexible deployment. On the tech side, new trends such as zero-disk architecture, Rust and C++ adoption, <a href="https://wesmckinney.com/blog/looking-back-15-years/">composable data systems</a>, and <a href="https://webassembly.org/">WebAssembly&#8217;s</a> (Wasm) <a href="https://x.com/lennypruss/status/1831373054360088712">slow but continued growth</a> have enabled us to actually build such systems.</p><p>Infrastructure vendors need to ask themselves: does our system deploy self-managed, BYOC, and as a SaaS cloud? Can we support in-process embedded, single-node, and clustered execution? If not, you&#8217;re cooked.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[FizzBee, TLA+, and (Practical) Formal Software Verification with JP Kadarkarai]]></title><description><![CDATA[JP, the creator of FizzBee, talks about formal methods, TLA+, distributed systems verification, and a better way forward.]]></description><link>https://materializedview.io/p/fizzbee-tla-and-formal-software-verification</link><guid isPermaLink="false">https://materializedview.io/p/fizzbee-tla-and-formal-software-verification</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Wed, 18 Sep 2024 11:24:30 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/95aba24a-24cf-4dba-b0d1-eaa2caa95e98_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I recently had the chance to interview <a href="https://www.linkedin.com/in/jayaprabhakar/overlay/about-this-profile/">Jayaprabhakar(JP) Kadarkarai</a>. JP is the creator of FizzBee, a design specification language and model checker to specify distributed systems. Prior to FizzBee, JP worked at Clumio, Lyft, and was a tech lead at Google.</p><div><hr></div><p><em><strong>C.R.: Let's start with how you got into formal methods. It looks like you spent 12 years at Google before moving on to Lyft and Clumio. At what point did you first come across formal methods, what problem were you solving, and which tools did you use?</strong></em></p><p>JP: Before joining <a href="https://google.com">Google</a>, the systems I worked with were relatively small-scale, and we managed consistency through centralized databases. However, at Google, everything we built had to be large-scale, highly available, and eventually consistent. This shift required a deep understanding of how to maintain data integrity and fault tolerance in a decentralized environment.</p><p>As I worked with systems like <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">GFS</a>, <a href="https://cloud.google.com/bigtable">Bigtable</a>, and <a href="https://cloud.google.com/spanner">Spanner</a>, and delved into their design papers, I saw firsthand the importance of ensuring system correctness, especially in scenarios involving data consistency and failure recovery. These papers often included rigorous proofs, highlighting the complexity of building robust systems. This experience sparked my interest in exploring more rigorous methods for verifying correctness.</p><p>After leaving Google, I noticed a gap in the broader industry, where microservices and NoSQL databases were often adopted without a solid theoretical foundation. At both <a href="https://www.lyft.com/">Lyft</a> and <a href="https://clumio.com/">Clumio</a>, I encountered consistency issues that often led to customer escalations and the need for refunds.</p><p>Having a habit of reading papers on various topics, I explored distributed systems classics and came across works by <a href="https://en.wikipedia.org/wiki/Leslie_Lamport">Leslie Lamport</a> and <a href="https://people.csail.mit.edu/lynch/">Nancy Lynch</a>. This led me to discover <a href="https://lamport.azurewebsites.net/tla/tla.html">TLA+</a>, which opened up a whole class of formal methods tools. Among these, TLA+ stood out as the most accessible and widely used.</p><p>My experience with formal methods, particularly TLA+, is relatively recent, spanning just a couple of years. The first project I used TLA+ for was at Clumio, where we needed to charge customers for their backups. The previous system had design issues, and fixing them with code changes like retries or locks didn't address the root cause&#8212;it often just moved the problem elsewhere. So, when I took over the project, I wanted to prove the system's correctness. Initially, I explained the design, but then I wrote the TLA+ specs as well. Unfortunately, I couldn't convince anyone to review the TLA+ specs.</p><p><em><strong>C.R.: I think I can guess, but why didn&#8217;t anyone want to review the TLA+ spec?</strong></em></p><p>J.P.: The main reason was the complexity of TLA+'s syntax and semantics. Its syntax isn't intuitive&#8212;it relies heavily on ASCII representations of mathematical symbols, which can feel alien to most programmers. The semantics also require a level of mathematical reasoning that many developers aren't used to.</p><p>Moreover, by the time I had learned TLA+ in my spare time and modeled the system, a few weeks had passed, and the spec was no longer a priority for the team. I did get my proposal accepted, but it wasn&#8217;t the smoothest process. If there had been a more accessible formal verification tool, I could have clearly demonstrated the advantages of my proposal without risking misunderstandings or tension.</p><p><em><strong>C.R.: I suppose this is where <a href="https://fizzbee.io/">FizzBee</a>&nbsp;enters the picture. How did you get from TLA+ to FizzBee, and can you give a brief overview of your new project?</strong></em></p><p>J.P.: While working with TLA+, I found myself instinctively translating its specs into a more familiar, Java-like syntax to make sense of them. This led me to realize that a tool with a syntax closer to programming languages would be more intuitive. I explored various options and ultimately decided on a Python-like syntax because it felt the most natural and concise for expressing complex ideas.</p><p>For these methods to be widely adopted&#8212;not just for mission-critical applications but for everyday software engineering&#8212;I believe they must be simpler and easier to use. When there&#8217;s a tradeoff between quality and time to market, time to market almost always takes precedence.</p><p>Our goal with FizzBee is to enable developers to ship high-quality software faster. It&#8217;s designed to make it easier for everyday software engineers to ensure system correctness and performance. FizzBee uses a Python-like syntax to make the process of modeling and verification more intuitive and concise. The time taken to model a system in FizzBee is often less than the time it takes to write a traditional design document.&nbsp;</p><p>Additionally, FizzBee automates the generation of visualizations, like block and sequence diagrams, directly from the model, making it easier to communicate and review designs. This approach not only simplifies the design process but also helps catch potential issues early, allowing teams to ship more quickly.</p><p><em><strong>C.R.: What does FizzBee&#8217;s architecture look like? How is it implemented?</strong></em></p><p>J.P.<strong>:</strong> FizzBee&#8217;s current implementation is fairly straightforward.</p><p>The parser built with <a href="https://www.antlr.org/">Antlr</a> converts the specification into an <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">abstract syntax tree (AST)</a>. The model checker is an interpreter implemented in Go, starting with a breadth-first search from the initial state, evaluating next states while following FizzBee-specific rules for forking, context switching, crashes, and more. It uses the <a href="https://github.com/bazelbuild/starlark">Starlark</a> interpreter to evaluate expressions, checks assertions, and eventually evaluates liveness.</p><p>Once the checking is complete, states are stored on disk, which we then use for performance modeling in Python. The online playground and visualizations are implemented in <a href="https://nodejs.org">Node.js</a>/<a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript">JavaScript</a>.</p><p>Currently, the model checker runs in a single process and thread, with all states kept in memory. Distributed model checking is in the plans, and I expect to start on it in a month. It will be exciting to make this the first project within FizzBee to use FizzBee itself to model its distributed design.</p><p><em><strong>C.R.: Did you consider writing a transpiler to convert FizzBee code to TLA+ so that you could leverage TLA+&#8217;s battle-tested <a href="https://proofs.tlapl.us/doc/web/content/Home.html">proof system</a>? Or do you have any other thoughts about TLA+ interoperability?</strong></em></p><p>J.P.: Yes, I did consider writing a transpiler from FizzBee to TLA+, mainly to leverage its model checker. But I quickly realized that differences in syntax&#8212;like TLA+&#8217;s use of 1-indexed arrays&#8212;would make this impractical and error-prone. In fact, building a transpiler would have been more complex than just creating an interpreter directly.</p><p>Additionally, TLA+ wouldn&#8217;t support many of the features I wanted. For example, TLA+ tracks only state transitions, FizzBee needs to capture more detailed information, like decisions made during those transitions. This is essential to supporting features like performance modeling and interactive visualizations.</p><p>As for TLAPS, while it&#8217;s an impressive proof system, very few people actually use it. Even among TLA+ users, I don&#8217;t see it gaining widespread adoption. When I talk to other developers, they&#8217;re more focused on verifying that the implementation works correctly, rather than proving the design is flawless. So, a proof system isn&#8217;t a priority for FizzBee. If we ever do build one, it would more likely involve direct use of solvers, rather than relying on TLAPS.</p><p><em><strong>C.R.: Given that you&#8217;re using Starlark, do you see any opportunity to somehow integrate the code in the design with the code that eventually gets shipped?</strong></em></p><p>J.P.: That&#8217;s an interesting idea. I haven&#8217;t looked into it deeply yet, but I&#8217;m generally cautious about code generation systems where developers need to tweak the generated code. The gap between high-level models and actual implementation can be quite large, so keeping everything in sync is challenging.</p><p>However, embedding Starlark code snippets directly into the implementation could be feasible in some cases. I&#8217;d need to think through it more to identify where it might be practical. It&#8217;s definitely something worth exploring further.</p><p><em><strong>C.R.:</strong></em><strong> Let&#8217;s look to the future a little bit. Paint me a picture of what success looks like for FizzBee. You mentioned a distributed model checker. What other tools and integrations do you want to implement, and where do you see adoption coming from?</strong></p><p>J.P.: FizzBee&#8217;s goal is to help teams ship high-quality software faster. By making formal methods practical and easier to use, we're addressing the stagnation in system design validation. Our recent release of interactive visualizations, for example, enhances communication of designs within teams. Ultimately, we aim to directly link these improvements to faster software delivery. Ensuring that the implementation matches the design is the critical missing link.</p><p>That&#8217;s our current focus. With FizzBee, from the model and its explored states and transitions, we can automatically generate comprehensive tests before any code is written. This approach delivers the benefits of Test-Driven Development without extra effort. We&#8217;re also exploring lightweight deterministic simulations to identify specific types of bugs. This would expand FizzBee to continuous builds.</p><p>As AI-assisted coding becomes more prevalent, validating that the code performs as intended will be essential for AI adoption. FizzBee will be instrumental in building that confidence.</p><p><em><strong>C.R.: The AI angle is interesting; I hadn&#8217;t considered that. How do you envision FizzBee working with AI generated code? If an LLM generates code, how would a developer validate it using FizzBee. Have you thought at all about using an LLM to generate code from a FizzBee model?</strong></em></p><p>J.P.: That&#8217;s a great question. When it comes to validating AI-generated code, FizzBee can provide value in two key ways: model-based testing and model-based monitoring.</p><p>First, with model-based testing, once a FizzBee model is built and validated, we can automatically generate comprehensive tests based on the model&#8217;s states and transitions. Whether the code is written by humans or AI, these tests ensure that the implementation behaves as expected, providing strong validation before deployment.</p><p>Second, FizzBee can enable model-based monitoring for post-production validation. FizzBee could monitor real-world behavior by analyzing databases, data lakes, logs, and other systems to detect anomalies. This would require developers to create a projection function that maps the runtime state of the application back to the model&#8217;s state. While this doesn&#8217;t prevent bugs from reaching production, it helps mitigate their impact by identifying them early on.</p><p>As for using an LLM to generate code from a FizzBee model, that&#8217;s trickier. FizzBee focuses on modeling complex, distributed systems logic, but a full application often includes a lot of business-specific logic that isn&#8217;t captured by the model. So, generating complete production code directly from a FizzBee model wouldn&#8217;t work&#8212;though it could generate basic scaffolding. We could introduce auxiliary structures to capture the additional requirements, but that&#8217;s beyond FizzBee&#8217;s current focus.</p><p>LLMs could, however, assist in other ways, such as helping generate FizzBee specs from natural language descriptions or enhancing FizzBee&#8217;s review platform by answering "what if" questions about system behavior.</p><p>Looking forward, LLMs have made tremendous progress in recent years, but for full code generation, they still need to handle intricate logic more reliably. I&#8217;m optimistic that future breakthroughs, perhaps in areas like reinforcement learning combined with LLMs, will bring us closer to that.</p><p><em><strong>C.R.: Thank you so much for taking the time to talk with me. It&#8217;s been a really thought provoking conversation. There&#8217;s so much more to cover, but in the interest of time, let&#8217;s call it here. I&#8217;ll give you the final word. Anything you&#8217;d like to share?</strong></em></p><p>J.P.: Thanks, Chris, I really enjoyed our conversation! I&#8217;m excited about the future of software development and the potential of formal methods.</p><p>We&#8217;re currently collaborating with early adopters and design partners through hands-on proof-of-concept projects. If you&#8217;re working on distributed systems and want to see how FizzBee can improve your development process, feel free to reach out at <a href="mailto:jp@fizzbee.io">jp@fizzbee.io</a>. You can also explore more at <a href="http://fizzbee.io/">fizzbee.io</a> or check out our GitHub repo at <a href="https://github.com/fizzbee-io/fizzbee">github.com/fizzbee-io/fizzbee</a>.</p>]]></content:encoded></item><item><title><![CDATA[Modular Monoliths Are a Good Idea, Actually]]></title><description><![CDATA[Microservices aren't the only way to get high cohesion and low coupling.]]></description><link>https://materializedview.io/p/modular-monoliths-are-a-good-idea</link><guid isPermaLink="false">https://materializedview.io/p/modular-monoliths-are-a-good-idea</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Fri, 13 Sep 2024 19:07:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0625bf29-765b-4c61-9cee-66a16dc7d828_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>I&#8217;ve been doing personal angel investing for several years. I&#8217;m excited to announce that I&#8217;ve launched <a href="https://materializedview.capital/">Materialized View Capital (MVC)</a>. MVC is a micro VC fund where I&#8217;ll continue investing in early stage infrastructure startups. I&#8217;ll also continue tagging any companies that I mention on my newsletter with a &#65129; if I&#8217;ve invested in them. Thanks for all your support!</em></p><p><em>In other news, my <a href="https://www.prefect.io/events/prefect-summit-2024">Prefect Summit 2024</a> keynote is up. Check out <a href="https://www.youtube.com/watch?v=8X0_RymDOHY">4 infrastructure trends in 20 minutes</a> to learn about primary persistence on object storage, composable databases, PostgreSQL's renaissance, and durable execution.</em></p><div id="youtube2-8X0_RymDOHY" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;8X0_RymDOHY&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/8X0_RymDOHY?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div><hr></div><p>It&#8217;s a story as old as time. A tech startup is born. Early engineers work night and day to build a product that customers want. They iterate furiously&#8212;adding new features and repurposing old code. No time to refactor; they need revenue. And then, if they&#8217;re lucky, miraculously, the startup gets customers. Product market fit is achieved, and it&#8217;s time to put the pedal to the metal. More customers, more features, more scale, and more engineers.</p><p>Somewhere along the way, the codebase goes from 100,000 lines to 10,000,000. The application that the early engineers built&#8212;a monolithic application in a single repository&#8212;is now a house of cards. Every change breaks something. It&#8217;s taking longer to build, test, and deploy. Even checking the code out is cause for a coffee break.</p><p>And then the FAANG engineers arrive. What are you even doing, they ask. You need scale, you need isolation, you need to decouple. You have a fever, and the only prescription is more microservices. And so the &#8220;escape the monolith&#8221; death march begins.</p><p>Scaling a monolith is hard, no doubt. To date, the only real tool we&#8217;ve had in our toolbox has been to switch to a service oriented architecture. Services can be built and deployed independently. This isolation keeps build times small, tests passing, and makes deployment easier.</p><p>At least that&#8217;s the pitch. In practice microservices can be just as tough to wrangle as monoliths. Services get tightly coupled; deploying a change in one can break another. Deploying 1000s of services independently requires an immense amount of tooling. Developers now need to spin up dozens of services to test locally, or manage their own cloud environment. Remote procedure call (RPC) frameworks need tooling to enforce compatible schema changes. Operations has to wrangle service meshes, distributed tracing, Kubernetes, Terraform, and so much more.</p><p>I&#8217;ve lived this story twice. First as an understudy at LinkedIn, then as an instigator at WePay. Both projects were brutal. At LinkedIn, we had to halt much of engineering for several months while we tore our monolith apart. Kevin Scott, our SVP of engineering wrote a great write-up about this in, <a href="https://www.linkedin.com/pulse/when-your-tech-debt-comes-due-kevin-scott/">When Your Tech Debt Comes Due</a>. At WePay, the monolith migration project never ended. I wrote WePay&#8217;s second microservice when I joined in 2015. When I left in 2021, the monolith still housed much of WePay&#8217;s core logic. I have microservice fatigue.</p><p>Yet, I don&#8217;t see a lot of alternative solutions being offered. So I&#8217;m pleased to see the <a href="https://www.milanjovanovic.tech/blog/what-is-a-modular-monolith">modular monolith</a> trend growing. The idea is to build a monolithic application as a <a href="https://www.thoughtworks.com/en-us/insights/blog/microservices/modular-monolith-better-way-build-software">series of modules</a>, each responsible for a portion of business logic. At first blush, this sounds silly. Isn&#8217;t this just good engineering? I was skeptical</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ACGe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ACGe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png 424w, https://substackcdn.com/image/fetch/$s_!ACGe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png 848w, https://substackcdn.com/image/fetch/$s_!ACGe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png 1272w, https://substackcdn.com/image/fetch/$s_!ACGe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ACGe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png" width="689" height="195" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:195,&quot;width&quot;:689,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37154,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ACGe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png 424w, https://substackcdn.com/image/fetch/$s_!ACGe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png 848w, https://substackcdn.com/image/fetch/$s_!ACGe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png 1272w, https://substackcdn.com/image/fetch/$s_!ACGe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b8c13b3-70c4-42de-ae0d-db293a8773b2_689x195.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>But the more I think about modular monoliths, the more excited I get. Yes, writing modular code is a good idea, and yes it&#8217;s obvious. Yet, we&#8217;re not doing it. Our monoliths always end up as spaghetti code. Rather than fixing the monolithic code, we&#8217;ve jumped straight to microservices. Why? I think the answer is tooling.</p><p>What might modular monolith tooling look like? Let&#8217;s start with the benefits that microservices confer: they can be built, tested, and deployed independently; they have isolated databases; and they have clear public APIs. To get similar characteristics from a monolith, developers need:</p><ul><li><p>Incremental build systems</p></li><li><p>Incremental testing frameworks</p></li><li><p>Branch management tooling</p></li><li><p>Code isolation enforcement</p></li><li><p>Database isolation enforcement</p></li></ul><p>Incremental build systems speed up monolithic build times. Rather than rebuilding an entire application, only the portions that change are rebuilt. Similarly, incremental testing allows developers and continuous integration (CI) systems to run tests only for the portion of the monolith that&#8217;s changed (including upstream and downstream dependencies). <a href="https://bazel.build/">Bazel</a> is doing a lot of work in this area.</p><p>Since all developers are committing to the same codebase, branch management is also important. There are many options here: <a href="https://docs.github.com/en/get-started/using-github/github-flow">GitHub Flow</a>, <a href="https://about.gitlab.com/topics/version-control/what-is-gitlab-flow/">GitLab Flow</a>, <a href="https://trunkbaseddevelopment.com/">Trunk-Based Development</a>, and so on.</p><p>By extension, CI tooling is important. If a developer breaks the monolith, they&#8217;ve broken it for everyone. Breaking builds must be quickly detected and fixed. Tools to predict whether a change is risky, to detect which change in a batch of commits broke the build, and to manage reverting or fixing forward are all required. Unlike incremental build and test, I find many teams are rolling their own scripts to manage such activities.</p><p>Changes that introduce new cross-module dependencies must be detected. Calls to non-public interfaces must be actively rejected. And code owners must be notified when new modules depend on their own modules. Code ownership files that define owners and approvers for each module must be added, and tooling built around them. <a href="https://gauge.sh">Gauge</a>&#65129; is doing interesting work here with their open source dependency management tool, <a href="https://github.com/gauge-sh/tach/">Tach</a>.</p><p>Finally, monolithic databases need to be broken up. This is often the biggest chunk of work for any monolith migration. All parts of the codebase share a single ORM and assume they&#8217;re interacting with a single transactional database. Transaction boundaries must be defined and separated between modules. Tables must be grouped by module and isolated from other modules. Tables must then be migrated to separate schemas either on the same database or a separate one. I am not aware of any tools that help detect such boundaries and enforce isolation right now.</p><p>For monoliths just starting out, it would be great if full stack frameworks like <a href="https://nextjs.org/">Next.js</a>, <a href="https://redwoodjs.com/">Redwood.js</a>, <a href="https://rubyonrails.org/">Rails</a>, and others began to adopt modular concepts in their codebase. This would help developers writing new software to do the write thing from the get-go.</p><p>Best of all, I don&#8217;t see the modular monolith vs. microservices as an either-or choice. I see it as a stepping stone that can extend a monolith&#8217;s life. For some, the modular monolith might be all that&#8217;s needed. For others, it can provide a more natural transition to a service based architecture. Modular monolith tooling could even facilitate such a migration. And I suspect the result of a migration would be a well maintained monolith with 7 &#177; 2 services. I find such an architecture far more appealing than 1000s of services. We&#8217;ve spent decades building tooling for microservices. It&#8217;s time to give monoliths the same respect.</p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[Search on PostgreSQL, Building Extensions, and pg_analytics with Philippe Noël]]></title><description><![CDATA[Philippe No&#235;l is CEO and co-founder of ParadeDB. In this post, Philippe and I discuss ParadeDB, the experience of building as a PostgreSQL extension, pg_duckdb, pg_lakehouse, and more ...]]></description><link>https://materializedview.io/p/search-on-postgresql-building-extensions</link><guid isPermaLink="false">https://materializedview.io/p/search-on-postgresql-building-extensions</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Tue, 03 Sep 2024 10:18:55 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/96fc519e-1523-47c8-9fb9-ac33cf61b740_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.linkedin.com/in/philippemnoel/">Philippe No&#235;l</a> is the CEO and co-founder of <a href="https://www.paradedb.com/">ParadeDB</a>. ParadeDB is a collection of extensions to make <a href="https://www.postgresql.org/">PostgreSQL</a> work for search and analytics use cases. Their first extension, <a href="https://www.paradedb.com/blog/introducing_search">pg_search</a>, added <a href="https://en.wikipedia.org/wiki/Okapi_BM25">BM25</a> for relevance scoring (the standard in the industry). They followed that work with the hit extensions <a href="https://github.com/paradedb/pg_analytics">pg_analytics</a> and <a href="https://github.com/paradedb/paradedb/tree/dev/pg_lakehouse">pg_lakehouse</a>, which add support for columnar queries and datalake integration.</p><p>Before ParadeDB, No&#235;l was the co-founder of <a href="https://github.com/whisthq">Whist</a>, a cloud-hybrid browser that accelerated web applications.</p><div><hr></div><p><em><strong>C.R.: You and I have been talking for the past year, give or take. I'm excited to finally have you on Materialized View. Let's start with the basics. What is ParadeDB and how did it come to exist?</strong></em></p><p>P.N.: Hey Chris! I've been a reader since day 1 and am excited to be a part of it.</p><p><a href="https://www.paradedb.com/">ParadeDB</a> is a <a href="https://www.elastic.co/elasticsearch">Elasticsearch</a> alternative for <a href="https://www.postgresql.org/">Postgres</a>. We build the core feature set of Elasticsearch (the data store), full-text search and fast analytics, in Postgres via Postgres extensions. The idea is that with ParadeDB, Postgres users can avoid needing to ETL data to Elasticsearch or similar tools in order to power user-facing search and analytics, and can instead stay within Postgres and keep their infrastructure simple. ParadeDB is an open-source project and is compatible with any existing Postgres deployment, including on managed Postgres services like AWS <a href="https://aws.amazon.com/rds/">RDS</a>, etc.</p><p>We came up with the idea while operating Postgres+Elasticsearch ourselves as part of a previous project, Retake. We were frustrated with needing to maintain an Elasticsearch cluster and an ETL pipeline, and by the lack of real-time and transactional guarantees in our search workload. We then went around asking our friends and fellow <a href="https://www.ycombinator.com/">YC</a> batchmates, and realized that this frustration was shared by a large number of them. Realizing the gap between people's love for Postgres and hate for Elastic convinced us to launch into this and thus ParadeDB was born.</p><p><em><strong>C.R.: The overlap between search and analytics is something I&#8217;ve written about recently in my 15 Years of Realtime OLAP series (<a href="https://materializedview.io/p/15-years-of-realtime-olap-part-1">part 1</a> and <a href="https://materializedview.io/p/15-years-of-realtime-olap-part-2">part 2</a>). These use cases are distinct, but they have remarkably similar infrastructure. Faceted search has always looked a lot like columnar aggregation to me, for example. And both have shifted from batch to realtime ingest.</strong></em></p><p><em><strong>How do you think about these use cases? ParadeDB has pg_search, pg_lakehouse, and pg_analytics extensions. You seem to be suggesting that they don&#8217;t warrant different infrastructure (i.e. Elastic), but you&#8217;ve implemented different extensions for each.</strong></em></p><p>P.N.: We think of these use cases from the perspective of the customer. While it is true that they are distinct, they are fundamentally the same: enable users to filter data to derive insights/take actions. Every major customer who comes to us for full-text search is also interested in faceted search, which as you point out is very similar to columnar aggregations. In fact, our faceted search uses a custom columnar implementation built by <a href="https://github.com/quickwit-oss/tantivy">Tantivy</a>.</p><p>These customers typically build products that contain both user-facing search and live dashboards. You can get very far powering real-time dashboards with faceted search, but sometimes you need to fetch data that's stored in object storage, or you need to perform aggregations over the JOINed tables, which inverted indexes aren't as optimal for.</p><p>We believe customers should pick a product by thinking about their business need(s) rather than about the infrastructure of the product itself. ParadeDB <a href="https://github.com/Casecommons/pg_search">pg_search</a> delivers <a href="https://en.wikipedia.org/wiki/Okapi_BM25">BM25</a> full-text search and faceted search in Postgres. We've also built <a href="https://github.com/paradedb/pg_analytics">pg_analytics</a> to offer on-disk columnar analytics in Postgres, and finally <a href="https://blog.paradedb.com/pages/introducing_lakehouse">pg_lakehouse</a> to offer analytics in Postgres over data in object storage like AWS <a href="https://aws.amazon.com/s3/">S3</a>. We're now combining pg_lakehouse and pg_analytics into a single extension, pg_analytics, which will offer both capability in a single extension to streamline development and adoption.</p><p>The use case is fundamentally the same: Customers want to offer user-facing search and analytics. They build their backend on top of Postgres, and ParadeDB enables them to offer these features on top of Postgres, no matter where their data lives.</p><p><em><strong>C.R.: That last comment&#8212;building on PostgreSQL&#8212;seems to be the real differentiator for ParadeDB. PostgreSQL has been experiencing a real renaissance lately. Why do you think that is, and how has your experience been working with it?</strong></em></p><p>P.N.: In my view Postgres isn't having a renaissance but is rather "coming of age". The last 30 years of smart design decisions have slowly made it the best open-source RDBMS. It's extremely reliable and very extensible, which is a really big deal.</p><p>The development of Postgres is open-source (not backed by any single company) and moves slowly. Maintainers are very opinionated and getting any patch in core is a serious task. This is good, as it guarantees quality. However, it means development is slow and Postgres could be left behind faster-moving database projects. pgvector was able to be developed very quickly as it sits outside of core, as an extension. Extensions enable Postgres to stay current with innovations in the data space while also maintaining extreme care to keep core as small and robust as possible. Every now and then, some extensions become so ubiquitous that it&#8217;s merged into core, like pg_stat_statement. The clear separation of work granted by extensions allow Postgres to uniquely accomplish the best of both worlds: Stay relevant while being rock-solid.</p><p>Working with Postgres has been wonderful.&nbsp;The extension ecosystem and API is very mature. Thanks to <a href="https://github.com/eeeebbbbrrrr">Eric Ridge</a> (<a href="https://www.zombodb.com/">ZomboDB</a>)'s <a href="https://github.com/pgcentralfoundation/pgrx">pgrx</a> project, we can build our extensions in <a href="https://www.rust-lang.org/">Rust</a> and benefit from the entire Rust open-source data community, without which ParadeDB couldn't exist. Oftentimes we find ourselves reading Postgres source and mailing lists, which is dense but the quality of code in core Postgres is quite amazing.&nbsp;Hacking on Postgres is really hard. <a href="https://www.linkedin.com/in/robertmhaas/">Robert Haas</a> wrote <a href="https://rhaas.blogspot.com/2024/05/hacking-on-postgresql-is-really-hard.html">a big post</a> about it, for the curious. It's hard to find developers who know Postgres core, though. We didn't before we started. We're hiring, so if a reader is looking to get into Postgres internals <a href="https://join.slack.com/t/paradedbcommunity/shared_invite/zt-2lkzdsetw-OiIgbyFeiibd1DG~6wFgTQ">come say hi</a>!</p><p><em><strong>C.R.: You mentioned that you&#8217;re combining pg_analytics and pg_lakehouse. <a href="https://www.hydra.so/">Hydra</a> and <a href="https://motherduck.com/">MotherDuck</a> <a href="https://motherduck.com/blog/pg_duckdb-postgresql-extension-for-duckdb-motherduck/">recently announced</a> <a href="https://github.com/duckdb/pg_duckdb">pg_duckdb</a> alongside <a href="https://duckdblabs.com/">DuckDB Labs</a>, <a href="https://neon.tech/">Neon</a>, and <a href="https://www.microsoft.com/">Microsoft</a>. The Postgres analytics ecosystem seems really hot and really fast changing right now.</strong></em></p><p><em><strong>How is pg_duckdb different from what ParadeDB is doing, why are you combining your two analytics extensions, and how do you see things shaking out in this space?</strong></em></p><p>P.N.: Analytics in Postgres should be a single extension, whether it's pulling data from object storage or from Postgres tables. That is the reason we&#8217;re consolidating pg_analytics and pg_lakehouse.</p><p>pg_duckdb is a cool project! I'm very excited to see it come to life. I had previously talked with DuckDB Labs and MotherDuck about donating our analytics work to DuckDB Labs to be the foundation of pg_duckdb, but others seem to have beaten us to it. I'm excited and cautiously optimistic it will flourish. If it does, it would be very good for everyone in Postgres, including for ParadeDB.</p><p>Building analytics in Postgres is really hard. Doing it properly is much harder than what ParadeDB could do today. My hope is that many companies work on pg_duckdb so it becomes a good foundation that we can rebase pg_analytics on top of. The vision is for many Postgres companies to build on top of pg_duckdb to easily create an analytics-in-Postgres extension for their platform.</p><p>For now, though, pg_duckdb is still very early. It also appears to be more DuckDB-centric than Postgres-centric, which concerns me. I am worried that MotherDuck, who seems to be the primary company driving this, does not care enough about Postgres to really make the project what it needs to be.&nbsp;Hopefully Neon gets involved, as they are one of the few players who have the Postgres expertise to really build it correctly. It's too early to tell, but we are following along.</p><p>On our side, ParadeDB is building an Elasticsearch alternative. We have plenty to do with search and already have a strong analytics offering with faceted search and our existing pg_analytics extension. If pg_duckdb evolves in the right direction, we'll be very eager to adopt it and to contribute to it. Fingers crossed!</p><p><em><strong>C.R. On the Elasticsearch front, what challenges have you found in implementing search in Postgres, and what does the roadmap look like for pg_search?</strong></em></p><p>P.N.: ParadeDB pg_search is built on top of a <a href="https://lucene.apache.org/">Lucene</a>-inspired library called Tantivy. Tantivy stores data in its own file format. Because Tantivy uses its own  format, we do not store data inside Postgres block storage (yet!). This means we have had to implement various of Postgres' native functionalities, like backup, WALs, etc. manually on top of this new storage. This is a lot of work, and we&#8217;re continuously improving upon it.</p><p>Today, pg_search is built via a multithreaded Postgres background writer and dynamic functions via the Postgres `CALL` API. It's quite uncommon, but very elegant. This enables us to play inline with Postgres as much as possible and has made our work integrating Tantivy much cleaner.</p><p>Roadmap-wise, we're working on features like:</p><ul><li><p>BM25 indexes over partitioned tables</p></li><li><p>BM25 indexes over JOINed tables (highly requested, but very difficult)</p></li><li><p>Full transaction isolation for faceted search</p></li></ul><p>Eventually, we're also planning to integrate with <a href="https://www.citusdata.com/">Citus</a> to offer a horizontally-scalable version of ParadeDB.</p><p><em><strong>C.R.: Partnering with <a href="https://www.citusdata.com/">Citus</a> is an interesting twist. ParadeDB&#8217;s extensions are licensed under the <a href="https://en.wikipedia.org/wiki/GNU_Affero_General_Public_License">AGPL</a>. I know <a href="https://blog.paradedb.com/pages/agpl">you&#8217;ve written about your license choice</a> in the past. You also don&#8217;t have a cloud offering right now. What&#8217;s your strategy for bringing ParadeDB to market?</strong></em></p><p>P.N.: Today, we sell commercial licenses of ParadeDB containing a few enterprise-relevant features on top of our open-source offerings. Our customers self-host ParadeDB clusters from those licenses.</p><p>We're working on a bring-your-own-cloud (BYOC) solution to make this process smoother, and are going to announce partnerships with Postgres platform providers to power a ParadeDB Cloud.</p><p><em><strong>C.R.: Wonderful. There&#8217;s plenty more to say, but I&#8217;ll save that for a future interview! Let&#8217;s call it here. Any parting thoughts?</strong></em></p><p>P.N.: For anyone interested in our project, you can find (and star, hihi) our repo here: <a href="https://github.com/paradedb/paradedb">https://github.com/paradedb/paradedb</a>. We welcome community contributions and are hiring for an early engineer role. If working on search and analytics in Postgres interests you, shoot me a note.</p><p>Thanks for having me, Chris! &#128578;</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://materializedview.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://materializedview.io/subscribe?"><span>Subscribe now</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://materializedview.io/p/search-on-postgresql-building-extensions?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://materializedview.io/p/search-on-postgresql-building-extensions?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> for a complete list.</p>]]></content:encoded></item><item><title><![CDATA[Unpacking the Buzz around ClickHouse]]></title><description><![CDATA[A look at the excitement around ClickHouse. I break down what makes it great, and look at the challenges ahead.]]></description><link>https://materializedview.io/p/unpacking-the-buzz-around-clickhouse</link><guid isPermaLink="false">https://materializedview.io/p/unpacking-the-buzz-around-clickhouse</guid><dc:creator><![CDATA[Chris]]></dc:creator><pubDate>Thu, 29 Aug 2024 19:58:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6be7c97e-357e-4b01-bc15-ee0c0dd5cca1_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>I have some exciting news! I&#8217;m helping <a href="https://www.linkedin.com/in/martinkleppmann">Martin Kleppmann</a> with the second edition of his popular book, <a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/">Designing Data-Intensive Applications</a>. Martin and I first met at LinkedIn, where we worked together on <a href="https://samza.apache.org/">Apache Samza</a>. I&#8217;m very excited to contribute in a small way to such an important and popular book. An early release of the first three chapters are now available to <a href="https://www.oreilly.com/online-learning/">O&#8217;Reilly Learning</a> subscribers, with more to come. See <a href="https://bsky.app/profile/martin.kleppmann.com/post/3l2rulyljdk2y">Martin&#8217;s post</a> for more details.</em></p><p><em>I also had a very pleasant conversation in the inaugural episode of <a href="https://techontherocks.show/">Tech on the Rocks</a>. We discussed stream processing, <a href="https://www.rust-lang.org/">Rust</a>, <a href="https://slatedb.io/">SlateDB</a>, and more. Give it a listen here:</em></p><div class="apple-podcast-container" data-component-name="ApplePodcastToDom"><iframe class="apple-podcast " data-attrs="{&quot;url&quot;:&quot;https://embed.podcasts.apple.com/us/podcast/stream-processing-lsms-and-leaky-abstractions-with/id1763670562?i=1000666321439&quot;,&quot;isEpisode&quot;:true,&quot;imageUrl&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/podcast-episode_1000666321439.jpg&quot;,&quot;title&quot;:&quot;Stream processing, LSMs and leaky abstractions with Chris Riccomini&quot;,&quot;podcastTitle&quot;:&quot;Tech on the Rocks&quot;,&quot;podcastByline&quot;:&quot;&quot;,&quot;duration&quot;:3186000,&quot;numEpisodes&quot;:&quot;&quot;,&quot;targetUrl&quot;:&quot;https://podcasts.apple.com/us/podcast/stream-processing-lsms-and-leaky-abstractions-with/id1763670562?i=1000666321439&amp;uo=4&quot;,&quot;releaseDate&quot;:&quot;2024-08-23T05:28:56Z&quot;}" src="https://embed.podcasts.apple.com/us/podcast/stream-processing-lsms-and-leaky-abstractions-with/id1763670562?i=1000666321439" frameborder="0" allow="autoplay *; encrypted-media *;" allowfullscreen="true"></iframe></div><div><hr></div><p>My two previous posts, 15 Years of Realtime OLAP (<a href="https://materializedview.io/p/15-years-of-realtime-olap-part-1">part 1</a>, <a href="https://materializedview.io/p/15-years-of-realtime-olap-part-2">part 2</a>), documented my experience with home-grown realtime OLAP systems, <a href="https://druid.apache.org/">Apache Druid</a>, and Apache Pinot. I also discussed the use cases I had for these systems: user facing product analytics and fraud detection. My intent was to lay the foundation for this post, where I investigate the buzz around <a href="https://clickhouse.com/">ClickHouse</a>.</p><p>ClickHouse has been appearing a lot in some of my recent interactions. <a href="https://www.tinybird.co/">Tinybird</a>, a user-facing analytics product, uses ClickHouse as its database. Several startups I&#8217;ve talked to are working on ClickHouse-related products. My Twitter feed has a lot of ClickHouse mentions, too.</p><p>What&#8217;s more, much of the feedback about ClickHouse appears to be quite positive, too. This caught my eye&#8212;we engineers tend to be a critical bunch. Moreover, when I looked at ClickHouse, I saw another realtime OLAP system like Druid or Pinot. Why all the attention?</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E6p2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E6p2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png 424w, https://substackcdn.com/image/fetch/$s_!E6p2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png 848w, https://substackcdn.com/image/fetch/$s_!E6p2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png 1272w, https://substackcdn.com/image/fetch/$s_!E6p2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E6p2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png" width="692" height="190" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:190,&quot;width&quot;:692,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40009,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E6p2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png 424w, https://substackcdn.com/image/fetch/$s_!E6p2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png 848w, https://substackcdn.com/image/fetch/$s_!E6p2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png 1272w, https://substackcdn.com/image/fetch/$s_!E6p2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2059d00-09b7-4da4-9ff1-a52810f3fa01_692x190.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://x.com/criccomini/status/1814090012352491741">View Post</a></figcaption></figure></div><p>The responses to this post were interesting. The feedback seems to boil down to three things: speed, ease of install, and ease of operations.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I_bz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I_bz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png 424w, https://substackcdn.com/image/fetch/$s_!I_bz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png 848w, https://substackcdn.com/image/fetch/$s_!I_bz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png 1272w, https://substackcdn.com/image/fetch/$s_!I_bz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I_bz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png" width="685" height="95.44267053701016" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58080b42-1995-40cd-a8ee-fe440381000f_689x96.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:96,&quot;width&quot;:689,&quot;resizeWidth&quot;:685,&quot;bytes&quot;:15975,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I_bz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png 424w, https://substackcdn.com/image/fetch/$s_!I_bz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png 848w, https://substackcdn.com/image/fetch/$s_!I_bz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png 1272w, https://substackcdn.com/image/fetch/$s_!I_bz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58080b42-1995-40cd-a8ee-fe440381000f_689x96.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Ovais Tariq, founder of TigrisData, <a href="https://x.com/ovaistariq/status/1814090188572041481">View Post</a></figcaption></figure></div><h2>Speed</h2><p>ClickHouse is indeed very fast. A friend told me, &#8220;They have the mentality of a team on a budget. Like they didn&#8217;t have 1000s of machines to throw at the problem. They had to make it work on what they had.&#8221; This might or might not be true&#8212;ClickHouse <a href="https://clickhouse.com/blog/introducing-click-house-inc">comes from Yandex</a>&#8212;but the software definitely has this vibe.</p><p>As proof, ClickHouse runs a <a href="https://benchmark.clickhouse.com/">benchmark project</a>. Database vendors may submit their databases to see how they fair. ClickHouse has been dominating its competition (at least until 6 months ago when <a href="https://umbra-db.com/">Umbra</a> <a href="https://benchmark.clickhouse.com/">showed up</a>). One can quibble over the workloads tested, but ClickHouse is clearly a very fast database.</p><h2>Installation</h2><p>Though speed is nice, it&#8217;s <a href="https://motherduck.com/blog/perf-is-not-enough/">not as important</a> as it used to be. Where ClickHouse really shines is its installation experience. Realtime OLAP systems are notoriously annoying to get running. Apache Druid and Apache Pinot both use bash scripts that spawn multiple local JVM-based services to get the system up and running. Either that or you&#8217;ve got to run Docker and use Helm charts, as is <a href="https://docs.starrocks.io/docs/quick_start/shared-nothing/">the case with StarRocks</a>.</p><p>ClickHouse, by contrast, is a single cURL command to <a href="https://clickhouse.com">clickhouse.com</a>. The server is smart enough that it recognizes the lack of user-agent in the HTTP request and automatically gives you a bash script to install a native binary for your host. It just works.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f7Xc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f7Xc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png 424w, https://substackcdn.com/image/fetch/$s_!f7Xc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png 848w, https://substackcdn.com/image/fetch/$s_!f7Xc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png 1272w, https://substackcdn.com/image/fetch/$s_!f7Xc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f7Xc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png" width="719" height="301" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26bd2512-c92f-4420-b280-729504045814_719x301.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:301,&quot;width&quot;:719,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42926,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f7Xc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png 424w, https://substackcdn.com/image/fetch/$s_!f7Xc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png 848w, https://substackcdn.com/image/fetch/$s_!f7Xc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png 1272w, https://substackcdn.com/image/fetch/$s_!f7Xc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26bd2512-c92f-4420-b280-729504045814_719x301.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://clickhouse.com/docs/en/install#quick-install">ClickHouse Quick Install</a></figcaption></figure></div><p>They&#8217;ve occupied a really nice spot between embedded OLAP systems like <a href="https://duckdb.org/">DuckDB</a> and distributed realtime OLAP systems like Apache Pinot and Apache Druid. It&#8217;s surprising to me that there aren&#8217;t more single-process realtime OLAP systems out there; it seems obvious in hindsight. ClickHouse seems unique in this regard, aside from recent <a href="https://www.postgresql.org/">PostgreSQL</a> OLAP developments (more on this later).</p><p>Another way of phrasing all this is that the developer experience (DX) is really nice. And a great developer experience leads to a lot of rave reviews on Twitter. I suspect this is where a lot of the buzz is coming from.</p><h2>Operations</h2><p>A great DX is nice and all, but does it scale and is it easy to operate? Here too, the feedback is positive, but more mixed. ClickHouse&#8217;s speed and efficiency mean it can scale up quite nicely&#8212;you can continue to run it on one big machine for quite a while.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cIKQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cIKQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png 424w, https://substackcdn.com/image/fetch/$s_!cIKQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png 848w, https://substackcdn.com/image/fetch/$s_!cIKQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png 1272w, https://substackcdn.com/image/fetch/$s_!cIKQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cIKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png" width="689" height="135" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:135,&quot;width&quot;:689,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31496,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cIKQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png 424w, https://substackcdn.com/image/fetch/$s_!cIKQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png 848w, https://substackcdn.com/image/fetch/$s_!cIKQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png 1272w, https://substackcdn.com/image/fetch/$s_!cIKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69358cd5-847f-4622-8aa0-9acdfbca2268_689x135.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Alasdair Brown, Product Marketer at Tinybird (<a href="https://x.com/sdairs/status/1814306357337567730">View Post</a>)</figcaption></figure></div><p>Once you&#8217;re ready to move beyond one machine, you&#8217;ll need to introduce another ClickHouse service: <a href="https://clickhouse.com/blog/clickhouse-keeper-a-zookeeper-alternative-written-in-cpp">ClickHouse Keeper</a>. Here, too, the developer experience is excellent. ClickHouse used to rely on ZooKeeper to coordinate its nodes in a distributed environment. Running ZooKeeper is tough, so ClickHouse wrote their own drop-in replacement, which they bundle <a href="https://clickhouse.com/docs/en/guides/sre/keeper/clickhouse-keeper#how-to-run">bundle into their binary</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4l1t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4l1t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png 424w, https://substackcdn.com/image/fetch/$s_!4l1t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png 848w, https://substackcdn.com/image/fetch/$s_!4l1t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png 1272w, https://substackcdn.com/image/fetch/$s_!4l1t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4l1t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png" width="683" height="164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/276edddc-0402-4385-b892-a356493946f7_683x164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:164,&quot;width&quot;:683,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36743,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4l1t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png 424w, https://substackcdn.com/image/fetch/$s_!4l1t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png 848w, https://substackcdn.com/image/fetch/$s_!4l1t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png 1272w, https://substackcdn.com/image/fetch/$s_!4l1t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276edddc-0402-4385-b892-a356493946f7_683x164.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The operational flexibility to scale up or out without adding a bunch of services is really valuable. And it&#8217;s run on some very large workloads. Tinybird has some customers <a href="https://www.tinybird.co/blog-posts/how-tinybird-scales">doing 300K-600K events per second</a>. <a href="https://uber.com/">Uber</a> adopted ClickHouse for their <a href="https://www.uber.com/blog/logging/">log analytics platform</a> (more on this later, too). And I assume Yandex&#8217;s usage is still fairly large.</p><p>Of course, there will always be operational challenges. <a href="https://www.linkedin.com/in/javisantana/">Javi Santana</a>, Tinybird&#8217;s co-founder, says it&#8217;s, <a href="https://x.com/javisantana/status/1815426749049835943">&#8220;super hard to run at scale.&#8221;</a> I also noticed some discussion about <a href="https://x.com/themoah/status/1814513479547605136">disk usage and imbalances</a>. And its XML configuration files are clunky.</p><h2>Challenges</h2><p>As nice as ClickHouse appears to be, I see a few challenges. The first and most significant is cost. Remember Uber&#8217;s ClickHouse log system I mentioned above? They&#8217;re moving off ClickHouse. <a href="https://www.linkedin.com/in/yupeng-fu-593115b/">Yupeng Fu</a> presented an excellent talk at <a href="https://startree.ai/">StarTree&#8217;s</a>&#65129; <a href="https://www.rtasummit.com/">RTA Summit 2024</a> called <a href="https://www.rtasummit.com/agenda/sessions/551916">Evolution of OLAP at Uber</a>. The talk discusses how Uber is replacing several pieces of infrastructure, including ClickHouse, with Apache Pinot.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-Fsj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-Fsj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png 424w, https://substackcdn.com/image/fetch/$s_!-Fsj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png 848w, https://substackcdn.com/image/fetch/$s_!-Fsj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png 1272w, https://substackcdn.com/image/fetch/$s_!-Fsj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-Fsj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png" width="1456" height="825" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:825,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:805500,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-Fsj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png 424w, https://substackcdn.com/image/fetch/$s_!-Fsj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png 848w, https://substackcdn.com/image/fetch/$s_!-Fsj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png 1272w, https://substackcdn.com/image/fetch/$s_!-Fsj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b503da-09ca-4407-b578-e48b7c3205fe_2334x1323.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yupeng says that Uber&#8217;s <a href="https://www.uber.com/blog/logging/">log analytics platform</a> migration in 2020 resulted in 50% cost savings when compared to their previous ELK-based log analytics system. But ELK is very expensive to run on large datasets. A 50% gain isn&#8217;t really that much. Since the migration, the team has hit cost and performance challenges. Stories like these are somewhat alarming for large-scale enterprises.</p><p>Another more subtle (and perhaps more minor) challenge with ClickHouse is its behavior with materialized views. Materialized views are important for many realtime analytics use cases. By updating aggregates when a write occurs, reads become very fast. Entire systems like <a href="https://materialize.com/">Materialize</a> and <a href="https://www.feldera.com/">Feldera</a> are built around this concept. ClickHouse supports materialized views, but updates are only triggered when the &#8220;main&#8221; table&#8212;the first table in a join&#8212;is written to. For many queries, especially those without joins, this is perfectly acceptable. But for more sophisticated use cases, it simply isn&#8217;t good enough.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CZqk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CZqk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png 424w, https://substackcdn.com/image/fetch/$s_!CZqk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png 848w, https://substackcdn.com/image/fetch/$s_!CZqk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png 1272w, https://substackcdn.com/image/fetch/$s_!CZqk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CZqk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png" width="698" height="272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e165275-2403-4f95-8092-205f0dafe027_698x272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:272,&quot;width&quot;:698,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74340,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CZqk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png 424w, https://substackcdn.com/image/fetch/$s_!CZqk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png 848w, https://substackcdn.com/image/fetch/$s_!CZqk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png 1272w, https://substackcdn.com/image/fetch/$s_!CZqk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e165275-2403-4f95-8092-205f0dafe027_698x272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Frank McSherry, Co-founder and Chief Scientist at Materialized (<a href="https://x.com/frankmcsherry/status/1814363285573251314">View Post</a>)</figcaption></figure></div><p>And finally, the elephant in the room: PostgreSQL is becoming an OLAP system. <a href="https://hydra.so">Hydra</a> recently published <a href="https://github.com/duckdb/pg_duckdb">pg_duckdb</a> with backing from <a href="https://motherduck.com/">MotherDuck</a>, <a href="https://www.microsoft.com/">Microsoft</a>, <a href="https://neon.tech/">Neon</a>, and others. Hydra&#8217;s extension integrates DuckDB (an even <em>more</em> buzzy project than ClickHouse) with PostgresSQL. And <a href="https://www.paradedb.com/">ParadeDB</a> has seen a lot of adoption with its <a href="https://www.paradedb.com/blog/introducing_lakehouse">pg_lakehouse</a>, <a href="https://github.com/paradedb/pg_analytics">pg_analytics</a>, and <a href="https://www.paradedb.com/blog/introducing_search">pg_search</a> PostgreSQL extensions.</p><p>As PostgreSQL&#8217;s OLAP extensions mature, it will be a great solution for the exact space that ClickHouse shines: single-node scale-up realtime OLAP with a great DX. If PostgreSQL takes ClickHouse&#8217;s single-node and small-scale usage, and systems like Pinot and Druid take its large-scale market, there&#8217;s not much left. This is the biggest long-term threat that I see for ClickHouse.</p><p>Still, as things stand now, ClickHouse is a robust system, and a reasonable solution for many use cases. I look forward to seeing how things shake out; I have a real soft spot for realtime analytics.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://materializedview.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://materializedview.io/subscribe?"><span>Subscribe now</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://materializedview.io/p/unpacking-the-buzz-around-clickhouse?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://materializedview.io/p/unpacking-the-buzz-around-clickhouse?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h4>Book</h4><p>Support this newsletter by purchasing <a href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838">The Missing README: A Guide for the New Software Engineer</a> for yourself or gifting it to someone.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CI0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png" width="146" height="164.12413793103448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de442631-41a6-4119-a99a-62957cd53edb_870x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:870,&quot;resizeWidth&quot;:146,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CI0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 424w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 848w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1272w, https://substackcdn.com/image/fetch/$s_!CI0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde442631-41a6-4119-a99a-62957cd53edb_870x978.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838&quot;,&quot;text&quot;:&quot;Buy Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838"><span>Buy Now</span></a></p><div><hr></div><h4>Disclaimer</h4><p>I occasionally invest in infrastructure startups. Companies that I&#8217;ve invested in are marked with a &#65129; in this newsletter. See my <a href="https://www.linkedin.com/in/riccomini/">LinkedIn profile</a> for a complete list.</p>]]></content:encoded></item></channel></rss>