The Quest for a Distributed POSIX-Compatible Filesystem
Distributed POSIX filesystems have proven elusive, but we're getting closer. Perhaps that's all we need.
Years ago, I was working on Apache Samza with Jay Kreps. At one point during a discussion about Samza’s state management system, Jay turned to me and said, “You know, we wouldn’t need any of this if we had a distributed filesystem that worked.” A scalable remote filesystem with normal POSIX semantics would let us build distributed systems that were stateless services; we could use the distributed filesystem to store everything. This comment stuck with me and I still think about it a lot.
Alas, we didn’t have such a system at the time. But object storage systems like S3 have grown to give us all the properties we need; they are insanely scalable and provide atomic operations needed for transactional workloads. Object stores are still missing POSIX semantics, though. You can’t take any old system that uses filesystem I/O libraries and use S3 as its storage layer.
In the absence of a POSIX API for S3, two approaches have emerged to leverage what object stores have to offer . The first is building S3 directly into the systems, which most database and streaming companies are doing. Neon, WarpStream, Turbopuffer, and Responsive﹩ are all in this category. The other is to wrap S3 in a filesystem in userspace (FUSE) interface or an NFS-based implementation. Amazon S3 File Gateway, JuiceFS, s3fs, and Goofys are examples of such an approach. (If you squint, Apache OpenDAL fits in this category, but inverts the relationship by wrapping everything in its own API.)
Direct integration requires a storage-layer rewrite. I/O calls must be converted to S3-compatible API calls. Moreover, each system needs to figure out how to deal with higher object storage latencies, a subject I’ve written about before.
Direct object storage integration makes sense for systems like databases, whose primary job is to store and query data. But for systems, frameworks, and libraries that just need to read and write files as part of a broader workload, rewriting the storage layer for object storage is too burdensome.
FUSE and NFS-based implementations present their own challenges. s3fs and other FUSE-based systems implement only a subset of the POSIX interface, often missing features such as random access writes or appends, metadata operations, atomic renames, hardlinks, and inotify features. Such limitations simply won’t work for many systems and libraries. It’s not clear if RocksDB, for example, can safely be run on EFS. And some implementations, such as JuiceFS, store files in their own block format, which limits interoperability.
The need for scalable filesystems has grown, too. Database and streaming use cases have been around for a long time, but the growth of Kubernetes and AI workloads are new. Kubernetes has made every system distributed. And with AI, multi-modal data such as audio, text, and video is a critical ingredient. Many ML and AI libraries are built with local filesystems, and those that support object storage often don’t have good caching to speed up workloads.
I think we’re finally getting to the point where we have both the technology and the demand to get us what we need: a distributed POSIX-compatible filesystem. Regatta Storage is building in this direction, with a stated goal of replacing EFS and elastic block store (EBS) general purpose (gp3) use cases.
Though Regatta doesn’t have complete POSIX compatibility, their offering is compelling. They are using NFS now, and moving to their own protocol. This should give them a lot of flexibility as they try to implement more obscure POSIX features. Even if they never get complete POSIX support, they should be able to get closer than current solutions. Regatta also provides a generic cache to reduce latency, a key ingredient for a disk-like experience. Such an architecture is akin to a generic version of Neon’s Safekeepers and Pageservers, something I’ve been dreaming of for a while. Plus, unlike JuiceFS, files appear in object storage as normal files rather than opaque blocks that must be accessed only through the JuiceFS interface.
I expect this area to get more competitive. Much as competition with S3 has driven AWS to improve its offering, I anticipate more EFS features in the future. Other object store providers such as Tigris ﹩are also well positioned to pursue this area. JuiceFS and Alluxio will also continue to make progress, too. AI-specific offerings could also emerge. Fortunately, I think the TAM is big enough (and the use cases diverse enough) to support winners for different use cases, even if the underlying technology is largely the same.
Book
Support this newsletter by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to someone.
Disclaimer
I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a ﹩ in this newsletter. See my LinkedIn profile and Materialized View Capital for a complete list.