ADR-0047: Coordinator quorum formation by ordinal join; offline snapshot restore as single-voter recovery

Status

Accepted, 2026-06-12. Ops scope — M75 (plan §12). Companion to ADR-0002 (single Raft group) and ADR-0008 (admin surface posture).

Context

The production-readiness audit (docs/production-readiness-2026-06.md, arc 5) verified that the documented 3-replica HA install bootstraps three independent single-node Raft clusters: every replica got --bootstrap, coordinator.AddVoter existed with zero callers, and the chart had no join mechanism. Each replica elected itself leader of its own cluster; whichever one a shard’s heartbeat reached owned a private copy of fleet state. The same audit found a snapshot export loop that nothing could restore from — no tooling, no procedure.

Two design choices needed making beyond the conventional patterns: where the bootstrap/join split lives, and what membership a restored snapshot carries.

Decision

Quorum formation

The conventional hashicorp/raft StatefulSet pattern: ordinal 0 bootstraps; ordinals >0 start unbootstrapped and join by asking the leader to AddVoter them.

The split lives in the binary, keyed on the ordinal in --id. The image is distroless — there is no shell in the pod to vary args per ordinal — so the chart passes identical args to every replica (--bootstrap plus --join-addr=<headless Service>:<grpc port>) and bigfleet coordinator parses the StatefulSet ordinal from its own --id (the same parsing the shard already does for seed striping). Ordinal 0 honours --bootstrap (still gated on raft.HasExistingState, so it never re-bootstraps over data); every other ordinal ignores it and joins. Without --join-addr nothing changes: dev, all-in-one and test single-node runs keep plain --bootstrap.
Join is one leader-only admin RPC, JoinRaftCluster(node_id, raft_address), on the existing M15 coordinator surface — not a new mechanism, port, or sidecar. ADR-0008’s open-trust posture is unchanged; M74 authenticates this RPC along with the rest of the surface.
The join loop is a membership reconciler, run on every ordinal. It retries with backoff, dialing --join-addr fresh per attempt (fresh dials re-resolve DNS, so pointing the flag at the headless Service spreads attempts across replicas until one is the leader; followers answer FailedPrecondition like every other admin RPC), and exits only when this replica observes a leader and is a voter at its current advertise address. The address half is load-bearing in Kubernetes: the stock raft TCP transport advertises a resolved IP (ResolveTCPAddr on the pod DNS name), and every pod restart changes that IP — the cluster’s configuration keeps dialing the dead one until an AddVoter rewrites it. A restarted replica that won an election anyway (it dials peers outbound) fixes its own entry locally instead of calling the leader-only RPC.
Re-join is idempotent. hashicorp/raft’s AddVoter of an existing voter rewrites the address in place rather than appending a duplicate membership entry (pinned by TestAddVoter_ExistingVoterIsIdempotent), and a replica restarting with an unchanged address hears the leader’s heartbeat before its first attempt fires and exits without joining.
Chart consequences of leadership-gated readiness. Readiness is grpc_health_v1 on the named Coordinator service, SERVING once the replica observes a leader; liveness is the default service, SERVING for the life of the process (quorum loss must never restart-loop the replicas that could re-form it). That forces podManagementPolicy: Parallel (OrderedReady deadlocks a full restart: pod 0 can’t be ready without a quorum that needs pod 1) and publishNotReadyAddresses: true on the headless Service (the leader must resolve a not-yet-ready joiner’s DNS to replicate to it).

Offline restore

bigfleetctl snapshot save <file> streams the leader’s freshest Raft snapshot over a new leader-only SnapshotSave RPC into a single-file archive (same meta.json + state payload the export loop writes, so exported directories restore identically). bigfleetctl snapshot restore is offline — it rebuilds a STOPPED coordinator’s data dir; there is deliberately no online restore RPC for v1.

hashicorp/raft restore semantics: at startup, raft installs the newest snapshot in its store — FSM state and the membership configuration recorded in the snapshot’s meta. Restore exploits that by writing the snapshot back with a single-voter configuration (just the restoring node) instead of the original membership, so the restored node elects itself immediately instead of waiting for quorum among peers whose data is gone — the same single-survivor shape as hashicorp’s peers.json / RecoverCluster recovery. The fresh stable store is seeded with CurrentTerm = snapshot term so post-restore entries don’t regress below the snapshot’s term. The other replicas restart with empty data dirs and re-form the quorum through the join path above. Full procedure: operator guide → Disaster recovery.

What a restore loses is bounded by the export interval (default 5m): shard registrations, cluster→shard bindings, and domain assignments committed since the snapshot. All three self-heal — shards re-register on their next heartbeat, and an unbound cluster re-binds on first contact — but a re-bind may land on a different shard than the lost binding, so machines owned by the original shard are reclaimed as unattributed rather than recognised. Static stability bounds the blast radius: shards and clusters keep running through the entire outage and restore; only new cross-shard placement decisions wait.

Consequences

The 3-replica chart actually forms a 3-voter cluster, kills a leader, and keeps committing (pinned by TestCoordinator_ThreeNodeQuorum_JoinAndFailover).
coordinator.bootstrap=true is now safe to leave set for the life of the install; the old “flip to false after first install” ceremony is gone.
Known footgun, accepted: if ordinal 0 alone loses its PVC while 1 and 2 keep theirs, ordinal 0 sees no existing state and mints a fresh single-node cluster — the standard hazard of this pattern. The old cluster (1, 2) retains quorum and keeps working; the recovery is to wipe ordinal 0’s data dir again and let it join via the leader, not to restore. Documented in the DR runbook.
The join RPC widens the unauthenticated admin surface by one membership-changing call until M74 lands. Same posture as RemoveShard, which could already evict state.