Durable Execution: Justifying the Bubble
From Temporal to an overflowing market, durable execution is having a moment. The space is too crowded and frameworks are hard to use. What needs to change?
There’s been a surge in durable execution frameworks over the past 6 to 12 months. Temporal has been the go-to for a while but many new projects and companies are emerging. Let’s look at why, and what needs to change.
Durable execution explained
Temporal’s Building Reliable Distributed Systems in Node blog post does a decent job of defining durable execution:
Durable execution systems run our code in a way that persists each step the code takes. If the process or container running the code dies, the code automatically continues running in another process with all state intact, including call stack and local variables.
Essentially, this is workflow orchestration with transactional state management.
We needed durable execution at WePay, the last company I worked for. WePay did payment processing and had to safely move money between accounts.
A money movement request can be in dozens of different states (like the “pending” state you see when you pay at a gas pump). State changes sometimes happen over long periods of time (weeks or months) and systems sometimes attempt invalid state changes. And state changes must happen transactionally.
Payment processing is a very common durable execution use case. So much so that nearly every durable execution system uses payments (or shopping carts) as their canonical example. Here’s Temporal’s MoneyTransfer(…) example. And here’s Restate’s addTicketToCart. Oh, and here’s Orkes’s deposit_payment.
The space is crowded
There are now many durable execution companies:
I’m sure I’ve missed many more (Azure Durable Functions?), to say nothing of the various open source projects.
The current market can’t sustain this many startups. For even a few big winners, the market has to become much larger. And for the market to become larger—to expand beyond payments and shopping carts—new use cases must be added.
Durable execution can subsume many common tasks such as work queues, batch workflows, stream processing, ETL, and more. Temporal already showcases business transaction, business process applications, and infrastructure management use cases.
Yet many application developers don’t pick durable execution frameworks as the first choice for such use cases.
Frameworks are hard to use
The adoption cost for durable execution is too high because the frameworks are too hard to use. Temporal is probably the pinnacle of usability at the moment, and its Python course is daunting. I don’t use these frameworks unless I have to.
Chris Gillum, the creator of Azure Durable Functions, summarizes many challenges in his post, Common Pitfalls with Durable Execution Frameworks, like Durable Functions or Temporal. To move beyond the payments-style use cases, Gillum’s issues must be addressed.
The good news is there’s a lot of experimentation happening! I’ve already highlighted LittleHorse’s user tasks that marry durable execution with traditional BPMN-style workflows; Camunda is coming from the opposite direction, from BPMN toward durable execution; Chris Gillum is working on durabletask-go; StealthRocket is playing with durable coroutines; And Rama is just… way, way out there.
Reconciling with stream processing
Stream processing is the biggest opportunity for experimentation. Many of the common pitfalls that Chris lists in his post are exactly the same problems that make stream processing hard: non-determinism, idempotency, at-least-once semantics, schema evolution, payload size, dead letter queues, and the list goes on. We dealt with the same problems when building Apache Samza (LinkedIn’s stream processor) ten years ago.
This week, Responsive’s [$] founder, Apurva Mehta, pointed out that stream processing can address many durable execution use cases. Maxim Fateev (Temporal’s CEO) responded:
I agree with Maxim, but stream processors could be used for these use cases with proper APIs and some Kafka improvements. Three such improvements are two-phase commit (KIP-939), Kafka queues (KIP-932), and optimistic locking on message keys (KAFKA-2260). We actually implemented the last one (locking) in Waltz, which we used to—you guessed it—build a durable execution framework for our payments state machine.
Stream processing also offers solutions to some of the problems that Gillum lists. Versioning, in particular, is something that the Kafka community has spent 15 years thinking about. Gillum even mentions this in his post (emphasis added).
There are two primary approaches that I’ve seen for dealing with the code versioning challenge. One is to make the code aware of different versions by adding if/else checks against version numbers. It’s not unlike putting schema version numbers in queue messages. … Durable Functions instead proposes deploying code changes into a separate copy of the app… Doing so removes the problem of needing to be careful about code changes but places a burden on the developer to manage multiple versions of their apps running side-by-side.
There’s an opportunity for collaboration and convergence here.
Restate, LittleHorse, Convex, and Rama are companies to look at in this space. They bridge stream processing, serverless functions, and transactional state management. These new frameworks often use a log like Kafka to store state.
Restate’s core is a distributed Raft log of events and commands, with indexes, the ability to compute over the log (to actualize commands) and to invoke functions/handlers based on events.
Convex does the same. Why We Built Restate’s first FAQ question is literally, "Well, isn’t this just like Kakfa plus Temporal?”, and their follow-on blog post is Restate + Kafka. LittleHorse is perhaps the best example; it actually is built on top of Kafka and Kafka Streams. A log-based architecture is standard for durable execution frameworks. Merging compute paradigms and APIs is a logical next step.1
Moving forward
Right now, we have a collection of first-generation durable execution and stream processing frameworks. No one has really nailed the hard parts and the APIs are clunky. That’s not a dig on any of these teams; these problems are really, really tough. I have the scars to prove it.
My hope is that second-generation frameworks will unify stream processing and durable execution with a more user-friendly API. Such a product would greatly expand the TAM for durable execution (or stream processing, for that matter) and justify the current durable execution bubble.
I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list.
Andreessen Horowitz writes about durable execution in The Modern Transactional Stack. The authors coin the term application logic transactional platform (ALTP). While an excellent overview, their framing is wrong. Where a16z sees a database, I see a write-ahead log. And when you have a write-ahead log, stream processing (or event-driven serverless functions) are the natural building block for computation. Thus it’s stream processing and durable execution (ALTPs) that should be reconciled, not workflow-centric and database-centric approaches, as they suggest.