Virtual Machines Are Getting Better

Unikernels, GPU checkpointing, and VM migration are going to reshape the cloud.

Oct 23, 2024

Virtual machines (VMs), containers, and serverless execution models have been around for a while. Until recently, these technologies have offered fairly generic features. Consequently, they kind of work for many use cases, but truly work for relatively few.

Microservices, serverless functions, AI, batch computation, stateful services, and other use cases all need special features to work well. Serverless functions need fast start times, AI and LLM workloads need GPUs, and many use cases could benefit from better state snapshotting technology. Yet, serverless functions suffer from high cold starts between 100ms and 1 second. Snapshots are slow, and only migrate certain pieces of state such as memory, while leaving other pieces, such as GPU state or network addresses behind. And GPU support in these execution models is spotty at best.

VM snapshots are useful when recovering from a failure or migrating execution to a new machine. Moving your workload to a different machine (or cloud) could save money, unlock better GPUs, or speed up training and inference. Cedana﹩, Modal﹩, and Microsoft have been working on this problem for some time.

This is why I’m excited to see a spate of recent developments that target cold start, snapshot, and GPU requirements.

GPU features are evolving quickly. gVisor added GPU support last year, NVIDIA recently open sourced their cuda-checkpoint tool, and Firecracker had a meeting on October 9th, 2024 to discuss GPU support.

NVIDIA’s cuda-checkpoint is particularly important. CUDA offers GPU APIs meant for generic computation (not gaming). Such APIs are widely used in AI models and LLMs. As developers execute operations on a GPU, the GPU’s memory accumulates state. This GPU data is very difficult to read directly, which poses a problem if you wish to snapshot a machine’s state. Now cuda-checkpoint offers a simple, bare-bones, free tool to do GPU checkpoint and recovery.

For serverless functions, unikernels such as Unikraft now boast single digit boot times and fast snapshotting. This should enable faster cold starts and scale-to-zero, which will result in cost savings. Many unikernels tout increased security in multi-tenant environments, as well. As unikernels add Kubernetes support, I expect adoption to increase, so non-serverless workloads like microservices will benefit. I’ve also heard there are benefits for specific verticals such as gaming.

Meanwhile, Loophole Labs has launched Architect to simplify VM migrations. They’ve built some really slick tech that incrementally snapshots and migrates both memory state and network bindings between machines. Architect purports to migrate faster than a spot instance preemption occurs, which would allow for huge cost savings for many workloads.

These developments will have a big impact on how we think about the cloud. We will, for example, see more multi-cloud adoption as state is easier to migrate and GPUs are harder to come by. Yingjun Wu (Founder of RisingWave) pointed me to SkyPilot, a UC Berkeley project that offers multi-cloud deployment specifically for AI, LLM, and batch workloads. Serverless functions might supplant service oriented applications, too. Vercel has been doing great work here. These are big shifts with big implications.

Book

Support this newsletter by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to someone.

Buy Now

Disclaimer

I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a ﹩ in this newsletter. See my LinkedIn profile and Materialized View Capital for a complete list.