The coordinator: Raft, BoltDB, and assignment

The coordinator is BigFleet’s Tier 1 — the global owner of which shard owns what, and nothing else. It is a small, low-write-rate replicated state machine: shard membership, the cluster→shard map, the topology-domain→shard map, quota slices, and the provider registry. At 100M nodes that whole table is a few megabytes and steady-state writes sit around 10/sec (ADR-0002). It makes no provisioning decisions — that is the shard’s hot path. Its defining property is that it is off that hot path: pkg/shard must not import pkg/coordinator, so the data plane keeps provisioning from its existing allocations while the coordinator is down or failing over. This doc is the code-grounded tour of how the FSM, the replicated State, BoltDB, quorum formation, and offline restore actually work; it complements docs/architecture.md’s shape diagram and the shard-side view in shard-hot-path.md.

One Raft group, single region

ADR-0002 fixes the topology: a single hashicorp/raft group of three replicas in three AZs of one operator-chosen region. Not multi-region (cross-region quorum pays inter-region RTTs on every write, and burst events — mass spot reclamation, AZ failure — are exactly when the latency floor bites), not hierarchical (a new tier, a new divergence failure mode, no requirement from the papers). The cost is a documented SPOF: a regional outage in the coordinator’s region halts cross-shard rebalancing and preemption fleet-wide for the duration. The data plane does not notice. Static stability is the load-bearing safety net that makes a single-region coordinator defensible — and is why the pkg/shard → pkg/coordinator non-import is a hard rule, not a guideline.

Coordinator.start() (pkg/coordinator/coordinator.go:179) wires the standard hashicorp/raft stack: a TCP transport bound to RaftBindAddress and advertising AdvertiseAddress (coordinator.go:200), two BoltDB stores — raft-log.db for the log and raft-stable.db for term/vote/config — via raft-boltdb/v2 (coordinator.go:206, :212), a FileSnapshotStore (coordinator.go:218), and raft.NewRaft over the FSM. SnapshotInterval is 30s and SnapshotThreshold is 1024 log entries (coordinator.go:186). Raft’s own logging still goes to stderr via hclog; slog isn’t bridged yet (coordinator.go:188).

The FSM: what gets replicated, and what deliberately doesn’t

FSM (pkg/coordinator/fsm.go:75) is the raft.FSM over *State. Mutations are Commands — JSON-encoded, because the volumes are tiny and protobuf would be premature (fsm.go:31). Seven command kinds (fsm.go:18): AddShard, RemoveShard, BindCluster, AssignDomain, UnassignDomain, SetQuota, UpsertProvider. Apply decodes and dispatches through applyCommand (fsm.go:113); each case calls the matching State method and returns its error, which Raft surfaces back to the proposer as the Apply future’s Response().

The write path is Coordinator.Apply (coordinator.go:279): encode the command, submit through raft.Apply with a 5s default timeout (clamped to the caller’s context deadline), then inspect both the future’s transport error and its FSM-returned Response() — an FSM error like ErrShardExists comes back as a non-nil error response and is returned to the caller. Apply is leader-only by construction; on a follower raft.Apply returns raft.ErrNotLeader. Three outcome labels feed CoordinatorApplyTotal (coordinator.go:295, :300, :304): error, fsm_error, success.

What is not replicated is as important as what is. Two categories of leader-local state live outside the FSM:

Fencing counters. FSM.seq (fsm.go:82) plus Raft’s term form the (term, sequence) fencing token stamped on every outgoing CoordinatorInstruction. NextSequence (fsm.go:96) is a plain mutex-guarded counter, not Raft-replicated: on failover the new leader starts a fresh sequence under whatever term Raft assigns it, and shards reject any instruction whose term is below their high-water mark (coordinator.proto:254). Monotonicity within a term is all that’s needed; the term itself supplies cross-failover ordering.
Soft state. Shard summaries and shortfalls (below) are leader-only and re-derived from fresh reports after a failover. The package doc (state.go:14) is explicit: shard reports/shortfalls are soft state held only on the leader, recomputed from fresh reports after a failover. Putting them through Raft would be replicating data that’s stale within one report cycle.

Snapshot and restore (in-process)

FSM.Snapshot (fsm.go:177) takes the State RLock and serialises the entire table to one JSON blob — shards, cluster→shard, domain→shard (as []domainShardPair), quotas (as []quotaPair with copied per-shard maps), providers. fsmSnapshot.Persist (fsm.go:249) writes that blob to the sink in one shot. Restore (fsm.go:214) decodes into a fresh State and replays each entry through the same State mutators (AddShard, BindCluster, …) so the snapshot path can’t construct an invalid table that the live path would reject; errors from those mutators are deliberately swallowed (_ =) because a well-formed snapshot can’t trip them. The whole-state dump is cheap precisely because of ADR-0002’s “a few megabytes” sizing.

The replicated state: per-domain assignment, first-contact binding

State (pkg/coordinator/state.go:60) is an RWMutex-guarded set of maps; every read returns a deep copy or a sorted snapshot so callers can’t alias internal maps. The five tables:

Shards (shards map[ShardID]ShardEntry). AddShard (state.go:128) rejects duplicate IDs with ErrShardExists; RemoveShard (state.go:145) cascades — it deletes the entry and every cluster and domain pointing at the gone shard. MarkHeartbeat (state.go:169) bumps LastHeartbeat and is a deliberate no-op for unknown shards.

Cluster → shard (clusterToShard). BindCluster (state.go:203) is the first-contact-wins contract: binding to the same shard is idempotent; binding an already-bound cluster to a different shard returns ErrClusterAlreadyBound. Per ADR-0007 the binding mechanism is operator-chosen at deploy time — the cluster owner sets the operator’s --shard-addr, the shard records the cluster on its first Shard.Session, and reconnects always dial the same shard ordinal. There is no registration RPC and no coordinator-driven routing; routing the operator’s dial through the coordinator would put a coordinator dependency on every reconnect and break static stability. The coordinator’s clusterToShard is the system-wide-visible copy, populated by an admin path when wanted, not the binding’s source of truth.

Domain → shard (domainToShard map[DomainKey]ShardID). This is the §0.1 decision made concrete: assignment is per topology domain, not per machine — ~100K entries at 100M-node scale, not 100M (state.go:71, plan §10.1). A DomainKey is a (label-key, label-value) pair (state.go:44), e.g. topology.kubernetes.io/rack=r17. AssignDomain (state.go:232) is idempotent to the same shard and returns ErrDomainAlreadyAssigned for a conflicting owner; DomainsForShard (state.go:263) returns a shard’s domains sorted for determinism. A Same-rack request that can’t be placed within a shard becomes a shortfall — topology constraints never cross shard boundaries, and there is no cross-shard resolution of them.

Quotas (quotas map[QuotaKey]QuotaAllocations). A QuotaKey is (provider, region) (state.go:53); the value is a per-shard slice count of speculative machines (state.go:94). SetQuota (state.go:282) overwrites the whole slice for a key. There is no quota/admission/entitlement API and no write RPC in v1 (ADR-0008): quota is bootstrap data, changed by restarting with new bootstrap state. ListQuotas is read-only, added in M24 so on-call can answer “Phase 1 has provider capacity but isn’t being asked.”

Providers (providers map[string]ProviderEntry). The registry of configured backends — name, dial address, region (state.go:101). Providers themselves are out-of-tree; this table just names them.

The gRPC surface: ReportShard, the admin RPCs, and identity

GRPCServer (pkg/coordinator/grpc_server.go:70) wraps the Coordinator. Its leader-local fields — latestSummary, latestShortfalls, and the pending instruction queue — live here, not on the FSM, “so the FSM stays focused on Raft-replicated facts” (grpc_server.go:75).

ReportShard (grpc_server.go:126) is the one steady-state RPC and the data plane’s only coordinator touchpoint. It is leader-only — followers reject with FailedPrecondition so the shard redirects (grpc_server.go:127). It does four things: (1) self-registers an unknown shard by Raft-Applying AddShard with the dial address the shard carries on every report (grpc_server.go:146), swallowing ErrShardExists on the already-registered path; (2) heartbeats via MarkHeartbeat; (3) stores the report’s ShardSummary and top-N Shortfalls into leader-local soft state (grpc_server.go:167); (4) clears acked instructions and returns the still-pending queue plus the coordinator’s current term (grpc_server.go:199). The ShardReport.cycle is a monotonic counter the coordinator uses to discard out-of-order delivery (coordinator.proto:152). Crucially, instructions are pulled: the shard initiates ReportShard (~30s), the coordinator piggybacks CoordinatorInstructions on the ReportAck, and the shard piggybacks InstructAcks on its next report (coordinator.proto:17). One unary RPC, shard-initiated — there is no coordinator→shard push channel, which is what keeps the shard’s coordinator dependency to a poll it controls.

Shortfall carries a resource-vector deficit, not a machine count (ADR-0027, coordinator.proto:215), and a bucketed interruption_penalty so the coordinator can price cross-shard preemption — distinct from reclamation_penalty, which is the shard-side machine-value signal and never appears here. Instructions (coordinator.proto:254) are the cross-shard mechanism: AssignDomain/UnassignDomain for split/merge, ReassignSpeculative (pure quota bookkeeping), CrossShardDrain (donor scores victims, emits reclaims to its operators), TransferOwnership (hand a now-idle machine to another shard). snapshotPending sorts the queue by sequence_number for deterministic wire order (grpc_server.go:264); the same instruction_id may be re-sent across acks until the shard acks it (coordinator.proto:243).

The admin surface — AssignDomain, UnassignDomain, RemoveShard, ListShards, ListDomainAssignments (M15), ListQuotas (M24), JoinRaftCluster, SnapshotSave (M75) — is uniformly leader-only (ADR-0008). Reads go through the leader’s State RLock rather than being served from followers: it costs the leader a little, but it avoids an undocumented stale-read contract and the client-side leader-cache logic that would otherwise leak into every caller. bigfleetctl is the single client.

Identity (grpc_server.go:20). ADR-0008’s original posture was “trust the cluster-internal network, unauthenticated, sidecar for external”; ADR-0048 supersedes the transport posture with opt-in mTLS while leaving the leader-only contract intact. Under mTLS the verified client cert’s bigfleet:// URI SAN gates every RPC: requireShardIdentity (grpc_server.go:37) binds the caller-asserted shard_id to bigfleet://shard/<id> (the same binding Shard.Session applies to a cluster’s Hello), and requireAdminIdentity (grpc_server.go:53) requires bigfleet://admin on the whole admin surface. Coordinator replicas carry the admin SAN themselves because they call JoinRaftCluster on each other — they are the admin domain (grpc_server.go:28). On plaintext transports both checks short-circuit to allow (the zero-config default): PeerIdentity reports mtls=false, and the check returns nil (grpc_server.go:39).

Quorum formation: ordinal-keyed bootstrap, reconciling join

ADR-0047 fixed a real bug found in the production-readiness audit: the documented 3-replica install was bootstrapping three independent single-node clusters — every replica got --bootstrap, AddVoter had zero callers, and whichever replica a shard’s heartbeat happened to reach owned a private copy of fleet state. The fix is the conventional hashicorp/raft StatefulSet pattern, with the bootstrap/join split moved into the binary because the distroless image has no shell to vary args per pod: the chart passes identical args to every replica (--bootstrap + --join-addr=<headless Service>), and cmd/bigfleet/coordinator.go:68 parses the StatefulSet ordinal out of --id. Ordinal 0 honours --bootstrap (still gated on raft.HasExistingState so it never re-bootstraps over data — coordinator.go:231); every other ordinal sets effectiveBootstrap = false and joins. Without --join-addr, nothing changes — dev/all-in-one/test single-node runs keep plain --bootstrap.

Join is a reconciler, not a one-shot (joinLoop, pkg/coordinator/join.go:41), and that matters because the TCP transport advertises a resolved IP (ResolveTCPAddr on the pod’s DNS name), and in Kubernetes every pod restart changes that IP while Raft’s configuration keeps pointing at the dead one. So the loop runs on every ordinal, every start, until membershipCurrent() (join.go:77) sees both a live leader and this node listed as a Voter at its current advertise address:

Fresh join: dial JoinRaftCluster against --join-addr (the headless Service, so re-resolved DNS spreads attempts across replicas until one is the leader; followers answer FailedPrecondition), backing off 1s→15s (join.go:42).
Restart, same address: membershipCurrent is true within a heartbeat and the loop exits without a join.
Restart, new IP: ask the leader to AddVoter at the new address; AddVoter of an existing voter is idempotent in hashicorp/raft — the config change rewrites the address in place (join.go:33). If this node won the election in the meantime, it fixes its own entry locally since the RPC is leader-only and it is the leader (join.go:53).

JoinRaftCluster (grpc_server.go:421) is the leader-only RPC that drives this — membership changes go through the leader’s log — gated on bigfleet://admin since joining replicas present the coordinator’s own cert.

Snapshot export and offline single-voter restore

Two DR concerns, both ADR-0047 / plan §10.8, both leader-only:

Export (pkg/coordinator/snapshot_export.go). snapshotExportLoop ticks every SnapshotExportInterval (default 5min); only the leader exports, since only it has a complete state to share, and a promoted former follower resumes exports (snapshot_export.go:33). Each export triggers a fresh raft.Snapshot() — treating ErrNothingNewToSnapshot as success, not failure, because it means the stored snapshot already reflects every committed apply (snapshot_export.go:61) — opens the most-recent stored snapshot, and copies meta.json + state into a timestamped dir under SnapshotExportDir, updating a latest symlink atomically via rename (snapshot_export.go:117). The reference impl writes a local path to stay free of cloud SDK deps; operators mount that path from durable object storage (FUSE, sidecar uploader). SnapshotSave (grpc_server.go:443) streams the same artefacts on demand for bigfleetctl snapshot save — meta_json frame first, then 1 MiB state chunks (grpc_server.go:468) — sharing openLatestSnapshot with the export loop. Leader-only on purpose: a follower’s snapshot can lag arbitrarily, and a backup that silently dropped recent writes is worse than a failed RPC.

Restore (pkg/coordinator/snapshot_restore.go). OpenSnapshotArchive (snapshot_restore.go:76) reads either layout — an export directory or a bigfleetctl tar — into the decoded raft.SnapshotMeta plus a state reader; decodeMeta rejects an index-0 meta as not a valid coordinator snapshot (snapshot_restore.go:143). RestoreSnapshot (snapshot_restore.go:175) rebuilds a stopped coordinator’s data dir. The load-bearing trick: it writes the snapshot back with a single-voter configuration — just the restoring node — instead of the original membership, so the restored node elects itself immediately rather than waiting on quorum among peers whose data is gone (snapshot_restore.go:196). This is the single-survivor recovery shape, equivalent to hashicorp’s peers.json / RecoverCluster: the other replicas must start with empty data dirs and re-form quorum through the ordinal-join path above. Two correctness details: existing raft-log.db/raft-stable.db/snapshots are removed first (snapshot_restore.go:182) because a snapshot at index N alongside an unrelated log is undefined behaviour; and CurrentTerm is seeded to the snapshot’s term in the fresh stable store (snapshot_restore.go:222) — with term 0 the restored node would campaign at term 1, below the snapshot’s term, and append post-restore entries with regressed terms.

What lives where

Concern	Replicated through Raft?	Where
Shard membership, cluster→shard, domain→shard, quotas, providers	Yes (FSM `Command`s)	`State`, `pkg/coordinator/state.go`
Fencing sequence	No — leader-local, fresh per failover	`FSM.seq`, `fsm.go:82`
Shard summaries, shortfalls, pending instructions	No — soft state, re-derived from reports	`GRPCServer`, `grpc_server.go:77`
Raft log, term, vote	BoltDB on local disk	`raft-log.db`, `raft-stable.db`
FSM snapshots	Local `FileSnapshotStore`; exported off-host	`DataDir/snapshots`, `SnapshotExportDir`