ADR-0047: Coordinator quorum formation by ordinal join; offline snapshot restore as single-voter recovery
Status
Accepted, 2026-06-12. Ops scope — M75 (plan §12). Companion to ADR-0002 (single Raft group) and ADR-0008 (admin surface posture).
Context
The production-readiness audit (docs/production-readiness-2026-06.md,
arc 5) verified that the documented 3-replica HA install bootstraps
three independent single-node Raft clusters: every replica got
--bootstrap, coordinator.AddVoter existed with zero callers, and
the chart had no join mechanism. Each replica elected itself leader
of its own cluster; whichever one a shard’s heartbeat reached owned a
private copy of fleet state. The same audit found a snapshot export
loop that nothing could restore from — no tooling, no procedure.
Two design choices needed making beyond the conventional patterns: where the bootstrap/join split lives, and what membership a restored snapshot carries.
Decision
Quorum formation
The conventional hashicorp/raft StatefulSet pattern: ordinal 0 bootstraps; ordinals >0 start unbootstrapped and join by asking the leader to AddVoter them.
- The split lives in the binary, keyed on the ordinal in
--id. The image is distroless — there is no shell in the pod to vary args per ordinal — so the chart passes identical args to every replica (--bootstrapplus--join-addr=<headless Service>:<grpc port>) andbigfleet coordinatorparses the StatefulSet ordinal from its own--id(the same parsing the shard already does for seed striping). Ordinal 0 honours--bootstrap(still gated onraft.HasExistingState, so it never re-bootstraps over data); every other ordinal ignores it and joins. Without--join-addrnothing changes: dev, all-in-one and test single-node runs keep plain--bootstrap. - Join is one leader-only admin RPC,
JoinRaftCluster(node_id, raft_address), on the existing M15 coordinator surface — not a new mechanism, port, or sidecar. ADR-0008’s open-trust posture is unchanged; M74 authenticates this RPC along with the rest of the surface. - The join loop is a membership reconciler, run on every ordinal.
It retries with backoff, dialing
--join-addrfresh per attempt (fresh dials re-resolve DNS, so pointing the flag at the headless Service spreads attempts across replicas until one is the leader; followers answerFailedPreconditionlike every other admin RPC), and exits only when this replica observes a leader and is a voter at its current advertise address. The address half is load-bearing in Kubernetes: the stock raft TCP transport advertises a resolved IP (ResolveTCPAddron the pod DNS name), and every pod restart changes that IP — the cluster’s configuration keeps dialing the dead one until anAddVoterrewrites it. A restarted replica that won an election anyway (it dials peers outbound) fixes its own entry locally instead of calling the leader-only RPC. - Re-join is idempotent. hashicorp/raft’s
AddVoterof an existing voter rewrites the address in place rather than appending a duplicate membership entry (pinned byTestAddVoter_ExistingVoterIsIdempotent), and a replica restarting with an unchanged address hears the leader’s heartbeat before its first attempt fires and exits without joining. - Chart consequences of leadership-gated readiness. Readiness is
grpc_health_v1on the named Coordinator service, SERVING once the replica observes a leader; liveness is the default service, SERVING for the life of the process (quorum loss must never restart-loop the replicas that could re-form it). That forcespodManagementPolicy: Parallel(OrderedReady deadlocks a full restart: pod 0 can’t be ready without a quorum that needs pod 1) andpublishNotReadyAddresses: trueon the headless Service (the leader must resolve a not-yet-ready joiner’s DNS to replicate to it).
Offline restore
bigfleetctl snapshot save <file> streams the leader’s freshest Raft
snapshot over a new leader-only SnapshotSave RPC into a single-file
archive (same meta.json + state payload the export loop writes, so
exported directories restore identically). bigfleetctl snapshot restore is offline — it rebuilds a STOPPED coordinator’s data
dir; there is deliberately no online restore RPC for v1.
hashicorp/raft restore semantics: at startup, raft installs the
newest snapshot in its store — FSM state and the membership
configuration recorded in the snapshot’s meta. Restore exploits that
by writing the snapshot back with a single-voter configuration
(just the restoring node) instead of the original membership, so the
restored node elects itself immediately instead of waiting for quorum
among peers whose data is gone — the same single-survivor shape as
hashicorp’s peers.json / RecoverCluster recovery. The fresh
stable store is seeded with CurrentTerm = snapshot term so
post-restore entries don’t regress below the snapshot’s term. The
other replicas restart with empty data dirs and re-form the
quorum through the join path above. Full procedure: operator guide →
Disaster recovery.
What a restore loses is bounded by the export interval (default 5m): shard registrations, cluster→shard bindings, and domain assignments committed since the snapshot. All three self-heal — shards re-register on their next heartbeat, and an unbound cluster re-binds on first contact — but a re-bind may land on a different shard than the lost binding, so machines owned by the original shard are reclaimed as unattributed rather than recognised. Static stability bounds the blast radius: shards and clusters keep running through the entire outage and restore; only new cross-shard placement decisions wait.
Consequences
- The 3-replica chart actually forms a 3-voter cluster, kills a
leader, and keeps committing (pinned by
TestCoordinator_ThreeNodeQuorum_JoinAndFailover). coordinator.bootstrap=trueis now safe to leave set for the life of the install; the old “flip to false after first install” ceremony is gone.- Known footgun, accepted: if ordinal 0 alone loses its PVC while 1 and 2 keep theirs, ordinal 0 sees no existing state and mints a fresh single-node cluster — the standard hazard of this pattern. The old cluster (1, 2) retains quorum and keeps working; the recovery is to wipe ordinal 0’s data dir again and let it join via the leader, not to restore. Documented in the DR runbook.
- The join RPC widens the unauthenticated admin surface by one
membership-changing call until M74 lands. Same posture as
RemoveShard, which could already evict state.