popper: the PatroniSet operator

A Go Kubernetes operator that deploys and manages Patroni-managed Postgres replication topologies as a single declarative resource, across both AWS and Azure.

The problem

Running highly-available Postgres on Kubernetes means orchestrating Patroni clusters: StatefulSets, services, config, storage, leader election, and failover, all kept consistent as nodes come and go. Doing that by hand, per cluster, does not scale. StatefulSets in particular get in the way: immutable VolumeClaimTemplates, deterministic pod ordering that fights primary placement, rolling updates that ignore which pod is the leader. The right answer is to replace them with a domain-aware operator.

What I built

popper is a custom-resource operator: you declare one PatroniSet, and it reconciles all the Kubernetes objects needed to stand up and maintain a Patroni replication topology. It is zone-aware, packaged with Helm, ships images via GitHub Actions, and is the production operator behind the managed-Postgres offering at Timescale/TigerData. It integrates with the sc fleet tooling and pgBackRest.

Some of the design moves that make it work at fleet scale:

  • Actions-as-typed-domain. The reconciler doesn’t have a single giant reconcile function. Instead, it picks from 26 named action types (provision_pod, provision_pvc, provision_replacement_pvc, perform_switchover, restart_postgres, restart_pod, delete_failed_pod, delete_redundant_pod, create_overprovisioner_pod, sync_pod_labels, update_volume, and more). Each action is its own file with its own preconditions and effects: testable, composable, easy to reason about.
  • Multi-cloud. A cloud-provider abstraction lets the same operator drive Patroni topologies across both AWS and Azure.
  • Coordinated watches. The controller watches 8 different Kubernetes resource streams (PatroniSets, Pods, PVCs, PVs, Snapshots, Endpoints, Nodes, Namespaces) and coordinates reconciliation across all of them without races.
  • Cache-only read path. Reads come from the controller-runtime cache, period: the reconciler never reads from the kube-apiserver. That keeps reconcile latency decoupled from apiserver health, enforces a coherent point-in-time snapshot across every read in a single loop, and bounds the apiserver load the operator imposes at fleet scale. It also drives the watch coverage above: anything the reconciler needs to read must be a watched resource.
  • Robust against stale reads. popper tracks the ResourceVersion of every write and refuses to act on cache reads older than the most recent confirmed write to that resource, giving “I won’t act on stale data” semantics without the cost of linearizable reads. A contract-strict fallback (set membership over observed write versions, slower but always correct) is ready if the current ResourceVersion-ordering implementation ever changes.
  • Status update optimization. The operator reconciles continuously but only writes back to etcd when something meaningful has changed, dramatically reducing etcd write load at fleet scale without losing observability.

Hard problems

  • Topology reconciliation. Continuously converge the live cluster to the declared PatroniSet spec (replicas, storage, config) without disrupting a healthy primary.
  • Failover-aware operations. Coordinate operator actions with Patroni’s own leader election so reconciliation never fights failover.
  • Storage & zone placement. Manage persistent volumes and spread replicas across zones for availability.
  • Ephemeral-replica migrations. Any change that’s bound to a particular pod or volume (downsizing a volume, which Kubernetes itself can’t do; migrating between storage classes; draining a cordoned node; replacing an instance) uses the same primitive: spin up an ephemeral replica with the new desired configuration, let Patroni replicate to it, then switchover. One workflow, many use cases, no Kubernetes-imposed limitations on what you can change.

Testing & verification

popper has unusual depth of testing for a Kubernetes operator:

  • Property-based simulation. A rapid-driven simulation test draws random topologies, mutations, and step sequences (thousands of cases per run) and asserts convergence properties. Failing cases are saved as reproducible seeds so a regression can be replayed deterministically.
  • End-to-end integration suite. 17+ named scenarios (clone, PITR, replace pod, replace volume, storage-class migration, zone placement, downsizing, add/remove members, maintenance, restart-postgres, sync-pod-labels, and more) runnable against either Kubernetes envtest or a real kind cluster.
  • Golden-test snapshots. Generated Kubernetes manifests are diffed against approved snapshots, so the exact shape of what the operator produces is pinned by tests.

Tech & scale

Go, controller-runtime, Kubebuilder v4, Patroni, Helm, GitHub Actions, AWS + Azure cloud providers. #1 contributor across 367 merged commits on main, ~1.6× the second author, over 3 years.

Public writeup

Co-designed and built with Andrew Charlton, who wrote the public engineering blog post for Tiger Data describing the system and the rationale for replacing StatefulSets:

Replacing StatefulSets With a Custom K8s Operator in Our Postgres Cloud Platform, Tiger Data engineering blog, October 2024.

Links

Timescale/TigerData repository, described here; source not publicly linkable.