sc
An internal multi-mode admin platform for Tiger Data’s managed-Postgres fleet. One Go binary, five deployment shapes: a local CLI for engineers, a Kubernetes node DaemonSet for canned investigation tools, an MCP server for agent-driven operations, a Slack bot for chat-driven ops, and a gRPC control plane that ties together fleet-wide actions across AWS and Azure clusters.
What it is
sc started as a CLI for administrative actions on Tiger Cloud’s managed Postgres fleet. Over three years it grew into the operational substrate that runs every fleet-touching action at Tiger: investigations, deployments, service lifecycle, workflow orchestration, support-engineer tooling, and the agent-facing surface for the platform.
It’s a single Go codebase that runs as:
scCLI: what engineers and SREs invoke locally to query and operate the fleet.- MCP stdio server:
scexposes itself as a Model Context Protocol server so an agent (Claude, Cursor, etc.) can drive any sc tool against any cluster in the fleet with the right authorization. - Kubernetes node DaemonSet: the same binary, started with
--mode node, runs on every Tiger-Cloud node. It exposes carefully-scoped investigation tools that require root locally (e.g. inspecting process state, kernel diagnostics) to less-privileged support engineers, so a support engineer can run aps-equivalent on a node without an SSH session or full root credentials. - Slack bot: chat-driven operations for routine fleet tasks.
- gRPC + HTTP control plane: the cross-cluster fleet manager that coordinates actions across AWS and Azure Kubernetes clusters.
A core design goal: bounded delegation for support engineers
A primary driver of sc’s architecture is enabling support engineers to do real investigations without holding broad cluster credentials. The whole admin layer (savannah-admin) exists as a privileged proxy that exposes a carefully-scoped set of investigation and remediation tools to less-privileged identities, with every request authenticated, authorized, and audited.
Concretely, that means a support engineer working a customer ticket can:
- Inspect a node’s process state without an SSH session
- Pull service logs across the fleet without direct apiserver credentials
- Drive a multi-step diagnose against a service without permission to mutate it
- Collect a core dump and run analysis against it from their laptop
- Submit a constrained remediation through sc’s workflow system with approval gating
…all without ever holding the underlying root or full-cluster permissions that would let them do something outside the bounded set. The capability is what they need; the privileges stay scoped. The design goal is bounded delegation of privileged operations to less-privileged identities, with full authn/authz at every layer: a security-engineering posture baked into the architecture, not bolted on.
Driving production LLM integration for the fleet
sc is also the substrate I use to push LLM-driven operations at Tiger Data:
improving time-to-resolution on outages and making diagnosis more systematic
across both the cloud database fleet and the backend platform itself. The MCP server exposes every fleet-touching capability to agents. The Slack bot brings the same capabilities to chat-driven incident response.
The sc mcp install command packages and installs curated Claude Skills:
domain-specific guidance shaped by what we’ve learned actually works in
production incidents, and self-update keeps every engineer’s installation
current as the skills evolve. The goal is leverage: every on-call engineer
should benefit from the team’s collective learning about what to look at first,
what to rule out, and what tools to reach for. I’m one of the engineers
driving that integration end-to-end: building the platform surface, the
skills, and the operational practices that make LLM-assisted incident response
real rather than aspirational.
The substrate
Some of the engineering choices that make a single binary safely span all of those modes and surfaces:
- End-to-end authenticated and authorized. Every interaction between sc modes (CLI to control plane, control plane to node DaemonSet, MCP to everything) is authenticated via Kubernetes service identities and protected by mTLS, with full authz checks at each hop. A support engineer using sc to inspect a node can only do what their identity is allowed to do, no more, and the authorization decision lives in the same place whether they reach the node via CLI, MCP, or Slack.
externalconnect: transparent in-cluster and out-of-cluster operations. A dedicated package lets sc operate against Kubernetes resources without caring whether the calling process is running inside a cluster, outside it, in a different cluster, or behind a private network. The same code paths work whether sc is invoked from a laptop on Tailscale, a CI runner with cluster credentials, or the node DaemonSet itself.- Concurrency primitives and fleet-wide rate limiting. Operations that
span the whole fleet (running a query across every service, rolling out a
configuration change to every cluster) go through a structured
concurrency layer with built-in rate limiting (
--rate 10/1mstyle, global across every sc command), so a single sc invocation can’t accidentally hammer the apiserver, the network, or a downstream service. Per-target backpressure and bounded fan-out keep fleet-wide campaigns safe by construction. - Workflows: making the case for vendor-grade infrastructure. sc has
always run its own workflow subsystem (campaign scheduling, per-target
progress, approval gates, retry/resume) which is how every fleet-wide
rollout has shipped. Tiger as a whole still leans on a hand-rolled Go
workflow controller that predates that work. To show engineering what a
real workflow engine could do for the company, I stood up Temporal and
Hatchet side-by-side inside sc as a head-to-head evaluation: same
submit / list / show / watch / approve / tailsurface for both, real workloads driving the comparison. Once a winner is chosen, sc, the internal testing framework, and customer-facing workflows all migrate onto it. - 62 gRPC services with 603 RPC methods. The control plane exposes a protobuf-defined surface that’s both internally rich (every fleet operation has a typed RPC) and externally accessible to agents via the MCP layer.
- Auto gRPC tunneling through admin clusters. sc transparently routes gRPC calls through the right admin cluster for the target service: engineers and agents don’t need to know which control plane sits in front of which workload cluster.
Capabilities exposed
A non-exhaustive sample of what an engineer (or agent) can drive through sc:
- Service lifecycle: create, clone, resize, switchover, replace, upgrade, pause/resume, delete, force-restart of Postgres services across the fleet.
- Storage operations: volume resize, replacement, snapshot, restore, PVC management, storage-class migration.
- Investigation tooling: service logs, ES log shipping, core-dump collection and analysis, database query tracing, Postgres parameter inspection, etcd maintenance.
- Automated diagnostics: a multi-step diagnose engine that walks through a checklist of service-health probes and produces a structured result for humans, agents, or support workflows.
- Workflow campaigns: fleet-wide rollouts (e.g. Postgres-version upgrades, hotfix installs) submitted through sc’s workflow system (and the Temporal/Hatchet PoCs) with per-stage approval gates and streaming progress.
- Declarative dev environments (
sc vm). Per-project.vm/config.yamldeclares AMI-baked, role-based EC2 topologies (single,cluster,storage-cluster,storage-cluster-nvme) that spin up real cloud-deployed dev environments in minutes: k3s + the full Tiger operator stack, with per-role healthchecks and diagnostics declared in config. A 30+ subcommand surface covers lifecycle, code sync, background-job submission and watching, SSH and EC2 serial-console access, AMI builds, and structured diagnostic capture, so engineers replace lima/local-VM workflows with reproducible cloud environments that match production architecture. - Built-in testing & benchmarking framework (
sc test tsdb). Provisions real TimescaleDB instances on demand, runstsbenchbenchmark profiles (per-query timing across the full aggregate-function × type matrix), writes per-test results to structured TSVs, and annotates the test pod itself with the run outcome via the Kubernetes API. Streaming--watchoutput, runs end-to-end in a few minutes, integrates with the same workflow + rate limiting substrate as the rest of sc.
Tech & scale
Go, gRPC, Protocol Buffers, Kubernetes client-go, Temporal, Hatchet, AWS, Azure,
etcd, mTLS, MCP, Slack, Claude Skills. Sole author from inception. 1,544
commits on main (83% of all commits), ~38× the second contributor, over ~3
years.
Links
Timescale/TigerData repository, described here; source not publicly linkable.