ublkstor
Replicated block storage with copy-on-write snapshots, a Raft-replicated control plane, S3 disaster recovery, and formal verification at every layer that matters, all with zero metadata overhead in the IO path.
The problem
Replicated block storage usually forces bad trade-offs: replica-per-node models cap a volume at one server’s capacity; random placement (Ceph CRUSH style) makes any N concurrent failures a data-loss risk that grows with cluster size; poll-driven data paths burn CPU on idle volumes; metadata services drag in a Postgres + Patroni + pgbouncer stack to operate; and crash consistency is asserted, not proven.
What I built
ublkstor presents standard Linux block devices (via ublk) backed by shards spread
across NVMe-over-TCP storage servers. The data path is Zig with io_uring,
zero-copy; the control plane is Go with embedded Raft. A volume is a collection
of shards distributed across a chosen set of servers (a 10 TB volume can span
five 3 TB servers) and per-volume stride (server set size) and replica count are tunable at creation, so each volume picks its own durability/capacity trade-off.
A Kubernetes CSI driver presents ublkstor volumes as standard PersistentVolumes, so any workload that consumes a PVC can be backed by ublkstor without changes.
app / fs ──ublk──► clientd (Zig) ──NVMe-TCP──► storaged (Zig) × N
│ gRPC
▼
metad (Go + Raft) ← 3-node StatefulSet, never in the IO path
│
▼ async
S3 — DR snapshots storaged runs one ublk device per backing block device: each ublk is an
independent data store that holds a collection of shards as byte ranges. By placing ublk between the NVMe-TCP front-end and the backing SSD, storaged gains a software
seam where per-backing-store metrics, per-shard access control, and a mapping layer
to the actual backing store all live, without paying the cost of one ublk or
NVMe target per shard.
Hard problems
- Bounded failure domains via copyset placement. Each volume picks a stride set of servers at creation; all its shards live on that set, bounding blast radius per volume while overlapping stride sets spread load across the cluster.
- Per-volume durability tuning. Stride and replica count are knobs per volume, not cluster-wide: a single ublkstor cluster can host a 2-way replicated archive volume next to a 3-way replicated database volume.
- Incremental resilver. Failures trigger resync using per-block dirty bitmaps: only diverged blocks resync, not the whole volume.
- Metadata-free data path. Reads, writes, and FLUSH never call metad; all IO flows directly over NVMe-TCP. Metadata ops happen asynchronously, never in the critical path.
- Cheap CoW snapshots. Generation tracking makes snapshots O(dirty shards), not O(volume); shard recycling eliminates 70–90% of allocation work.
- No idle CPU burn. io_uring completion-driven, not poll-driven: an idle volume costs zero CPU.
- Replicated metadata, formally pinned. metad is a 3-node Raft cluster running as a Kubernetes StatefulSet, with provenance markers (role/epoch/quorum) and a TLA+-modelled recovery protocol, including an eager on-sync stamp, jittered self-reset, and explicit S3 disaster recovery.
- Kubernetes-native via CSI. A CSI driver presents ublkstor volumes as standard PersistentVolumes: any pod that consumes a PVC works unchanged.
- Operationally simple. No Postgres, no Patroni, no pgbouncer. The whole control plane is one Go binary, three pods, and an S3 bucket, and every recovery decision is pinned to a TLA+ model that exhaustively explores its state space.
Formal verification
Six TLA+ models exhaustively explore billions of states across both the data and control planes, with zero invariant violations on the verified models:
UblkDrv:ublkdriver and FLUSH semanticsReplicatedShard: replication and failure recovery on a single shardResilverRecovery: incremental resilver via dirty bitmapsMergeForward: snapshot correctness and CoW mergeMetadShardDispatcher: shard placement and dispatchMetadRaftRecover: Raft-cluster recovery, on-sync stamping, S3 DR
The control-plane recovery model (MetadRaftRecover) is co-developed with the metad
implementation: design decisions are pinned to TLA+ first, then implemented, so the
recovery protocol’s correctness is established before the code.
Where it’s going
CSI VolumeSnapshots. Snapshots already exist as a first-class concept in the data
plane: generation-tracked, O(dirty shards), CoW under the hood. The next step is
wiring them through the CSI VolumeSnapshot API so Kubernetes-native backup and
restore tooling (Velero, kubectl) can drive them directly.
S3-backed replicas. The ublk-per-backing-device design opens a door: a backing
device doesn’t have to be a local SSD. A replica whose backing store is an S3 bucket
rather than a local disk lets a volume drop its local replica count to one while
keeping a higher durability count via S3-backed replicas, collapsing the cost of
durable storage without changing the replication protocol clientd sees.
Tech & scale
Zig (io_uring, zero-copy data path) + Go (Raft-replicated control plane) + a Go CSI
driver, 1,579 commits, solo. NVMe-TCP, ublk, TLA+, SQLite, Raft, CSI, S3. Builds on
my open-source Zig libraries (rbitz Roaring bitmaps for dirty tracking, and others).
Links
Private repository, access available on request. Public Zig building blocks: github.com/graveland.