ublkstor

Replicated block storage with copy-on-write snapshots, a Raft-replicated control plane, S3 disaster recovery, and formal verification at every layer that matters, all with zero metadata overhead in the IO path.

The problem

Replicated block storage usually forces bad trade-offs: replica-per-node models cap a volume at one server’s capacity; random placement (Ceph CRUSH style) makes any N concurrent failures a data-loss risk that grows with cluster size; poll-driven data paths burn CPU on idle volumes; metadata services drag in a Postgres + Patroni + pgbouncer stack to operate; and crash consistency is asserted, not proven.

What I built

ublkstor presents standard Linux block devices (via ublk) backed by shards spread across NVMe-over-TCP storage servers. The data path is Zig with io_uring, zero-copy; the control plane is Go with embedded Raft. A volume is a collection of shards distributed across a chosen set of servers (a 10 TB volume can span five 3 TB servers) and per-volume stride (server set size) and replica count are tunable at creation, so each volume picks its own durability/capacity trade-off.

A Kubernetes CSI driver presents ublkstor volumes as standard PersistentVolumes, so any workload that consumes a PVC can be backed by ublkstor without changes.

app / fs ──ublk──► clientd (Zig) ──NVMe-TCP──► storaged (Zig) × N
                        │ gRPC
                        ▼
                   metad (Go + Raft)   ← 3-node StatefulSet, never in the IO path
                        │
                        ▼ async
                       S3   — DR snapshots

storaged runs one ublk device per backing block device: each ublk is an independent data store that holds a collection of shards as byte ranges. By placing ublk between the NVMe-TCP front-end and the backing SSD, storaged gains a software seam where per-backing-store metrics, per-shard access control, and a mapping layer to the actual backing store all live, without paying the cost of one ublk or NVMe target per shard.

Hard problems

Bounded failure domains via copyset placement. Each volume picks a stride set of servers at creation; all its shards live on that set, bounding blast radius per volume while overlapping stride sets spread load across the cluster.
Per-volume durability tuning. Stride and replica count are knobs per volume, not cluster-wide: a single ublkstor cluster can host a 2-way replicated archive volume next to a 3-way replicated database volume.
Incremental resilver. Failures trigger resync using per-block dirty bitmaps: only diverged blocks resync, not the whole volume.
Metadata-free data path. Reads, writes, and FLUSH never call metad; all IO flows directly over NVMe-TCP. Metadata ops happen asynchronously, never in the critical path.
Cheap CoW snapshots. Generation tracking makes snapshots O(dirty shards), not O(volume); shard recycling eliminates 70–90% of allocation work.
No idle CPU burn. io_uring completion-driven, not poll-driven: an idle volume costs zero CPU.
Replicated metadata, formally pinned. metad is a 3-node Raft cluster running as a Kubernetes StatefulSet, with provenance markers (role/epoch/quorum) and a TLA+-modelled recovery protocol, including an eager on-sync stamp, jittered self-reset, and explicit S3 disaster recovery.
Kubernetes-native via CSI. A CSI driver presents ublkstor volumes as standard PersistentVolumes: any pod that consumes a PVC works unchanged.
Operationally simple. No Postgres, no Patroni, no pgbouncer. The whole control plane is one Go binary, three pods, and an S3 bucket, and every recovery decision is pinned to a TLA+ model that exhaustively explores its state space.

Formal verification

Six TLA+ models exhaustively explore billions of states across both the data and control planes, with zero invariant violations on the verified models:

UblkDrv: ublk driver and FLUSH semantics
ReplicatedShard: replication and failure recovery on a single shard
ResilverRecovery: incremental resilver via dirty bitmaps
MergeForward: snapshot correctness and CoW merge
MetadShardDispatcher: shard placement and dispatch
MetadRaftRecover: Raft-cluster recovery, on-sync stamping, S3 DR

The control-plane recovery model (MetadRaftRecover) is co-developed with the metad implementation: design decisions are pinned to TLA+ first, then implemented, so the recovery protocol’s correctness is established before the code.

Where it’s going

CSI VolumeSnapshots. Snapshots already exist as a first-class concept in the data plane: generation-tracked, O(dirty shards), CoW under the hood. The next step is wiring them through the CSI VolumeSnapshot API so Kubernetes-native backup and restore tooling (Velero, kubectl) can drive them directly.

S3-backed replicas. The ublk-per-backing-device design opens a door: a backing device doesn’t have to be a local SSD. A replica whose backing store is an S3 bucket rather than a local disk lets a volume drop its local replica count to one while keeping a higher durability count via S3-backed replicas, collapsing the cost of durable storage without changing the replication protocol clientd sees.

Tech & scale

Zig (io_uring, zero-copy data path) + Go (Raft-replicated control plane) + a Go CSI driver, 1,579 commits, solo. NVMe-TCP, ublk, TLA+, SQLite, Raft, CSI, S3. Builds on my open-source Zig libraries (rbitz Roaring bitmaps for dirty tracking, and others).

Links

Private repository, access available on request. Public Zig building blocks: github.com/graveland.