ADR-0055: coordinator-driven cross-shard rebalancing (§9 idle→quota→preempt)
Status
Proposed, 2026-06-19 — author decision (2026-06-19) to BUILD cross-shard rebalancing rather than remove the stub handlers. Realises bigfleet.md §9 (“Coordinator response: (1) reassign idle from other shards, (2) reassign speculative quota, (3) cross-shard preemption — drain-first, expensive, rare”). Builds on ADR-0002 (single-region single-Raft, the cross-shard rebalancing SPOF), and is the cross-shard counterpart to the in-shard anti-oscillation work of ADR-0045 (Phase-3 = pure diff over Phase-1’s claimed set) and ADR-0052 (shard counts its own in-flight commitment). Reuses the M20/M69 drain path. No new instruction types; one soft-state field and one author-gated wire selector field.
Author decisions (resolved 2026-06-19)
The author approved this design as Proposed — to revisit before staged implementation begins — and resolved the two architectural forks:
- Machine-id sourcing: donor-resolves. The coordinator addresses the
three instructions by a
count + (instance_type, zone)selector; the donor shard picks the concrete idle ids (honouring itsreclamation_penaltyidle tiebreak) and echoes them. Selector fields are added toReassignSpeculative/CrossShardDrain/TransferOwnership; the §9 ~240-byte/shard summary budget and “shards own capacity” hold. - Ownership partition: shard-local persisted owned-set (NOT the draft’s
provider-side ownership tracking). Each shard persists the set of machine
ids it owns; its
reconcileFullconsults that set so a transferred machine survives the next reconcile without any change to the provider contract. Trade-off: a shard restart rebuilds the owned-set fromprovider.Listand re-establishes the partition as the coordinator re-asserts assignments — acceptable, and it keeps the six-RPC provider surface untouched (this realises the “ownership-partition mechanism” stage, superseding the draft’s option (a)).
Remaining tunables (cooldown ≥ 8 cycles, age > 5 cheap / > 12 preempt, ~5-cycle cadence, soft leader-local cooldown, donor drain grace via the existing Phase-2 grace function) stand at the recommended defaults pending implementation. Status stays Proposed until the author returns to greenlight the staged build.
Context
The wire and delivery for cross-shard rebalancing already exist and are
sound. CoordinatorInstruction (coordinator.proto:251-312) carries the
(coordinator_term, sequence_number, instruction_id) fence and a oneof
with reassign_speculative(6) / cross_shard_drain(7) /
transfer_ownership(8). Delivery is pull-based:
GRPCServer.EnqueueInstruction(shardID, instr) →
g.pending[shard][instruction_id] → drained into ReportAck.instructions
on the shard’s next ReportShard, sorted by sequence_number; the shard
acks via ShardReport.instruction_acks; the coordinator resends until
acked; the shard dedups by instruction_id and term-fences against its
high-water mark. This path needs no changes.
Two halves are missing:
- Coordinator emission does not exist.
EnqueueInstructionhas zero non-test callers. There is no leader-side loop that constructs these instructions. The only periodic leader loops aresnapshotExportLoopandjoinLoop(coordinator.go:164,167), both started fromRunand self-gated onraft.Leader. A rebalancer is the natural third loop. - Shard execution is stubbed by design.
shardAdapter’sOnReassignSpeculative/OnCrossShardDrain/OnTransferOwnership(coordclient/adapter.go:72-74) are literalreturn nilno-ops; the dispatch, dedup, term-fence and ack plumbing around them (coordclient.go) is complete.
The coordinator already holds every input except two gaps:
- Per-shard
ShardSummarySoft(TotalMachines, FreeMachines, InstanceTypeCounts, ZoneCounts) and[]ShortfallSoft(Priority, Deficit vector, AgeCycles, InterruptionPenaltyBucket) are collected on everyReportShardand exposed viaLatestSummary/LatestShortfalls. domain→shard, cluster→shard and quota allocations are in Raft state. - Gap 1 (soft-state):
ShortfallSoftDROPS the requirements list (grpc_server.go:181-186 copies only four fields). The shard sends requirements incl. theSameoperator (adapter.go:38-45), but the coordinator cannot today tell a topology-constrained shortfall from a plain one — which the first hard-rule gate requires. - Gap 2 (wire-vs-state): all three instructions are addressed by
repeated machine_ids, butShardSummaryis COUNTS only. The coordinator can decide “shard B has 40 free type-X/zone-Z and shard A has a priority-700 deficit for type-X/zone-Z” but cannot name the ids. Inflating the summary with per-machine ids breaks the §9 ~240-byte/shard budget.
The known danger is oscillation. ADR-0045 makes Phase 3 a memoryless diff (excess = bound − demand at machine granularity); a machine transferred into shard A arrives Idle/unbound and, if A has no matching open deficit, is surplus that A’s Phase 3 reclaims next cycle — the cross-shard analog of the M77a ping-pong. The donor, having lost the machine, may re-open its deficit and re-request. ADR-0052 adds a one-cycle window where a transferred-in Idle machine is uncredited until the receiver’s Phase 1 claims it. The coordinator has no hysteresis state today; it must be built in as a first-class requirement.
Decision
Build a leader-only periodic rebalanceLoop in pkg/coordinator that
emits the three existing instructions under a strict §9 cost-ordered
escalation, and make the three shard stub handlers real. Six elements:
1. Loop placement and trigger
A third goroutine started from Coordinator.Run, ticker-driven on
--rebalance-interval (default a small multiple of the shard report
cycle, ~5×), self-gated if c.raft.State() != raft.Leader { continue }.
Periodic sweep, not per-report: donor selection needs a global view
across all shards’ soft state, which one shard’s report cannot give, and
a periodic pass damps oscillation. The flag MUST be wired into
deploy/helm/ in the same change (lessons_learned:586 — a phantom
--rebalance-interval once crashlooped the published chart).
2. Eligibility gate (hard rules first)
Per pass, build the eligible-shortfall set across all shards. A shortfall is cross-shard-eligible iff:
AgeCycles > 5(the §9 escalation gate — primary trigger and primary hysteresis), AND- it is NOT topology-constrained. Soft-state fix: add a precomputed
TopologyConstrained boolto the wireShortfall(shard sets it true iff any requirement usesOperatorSame) and copy it intoShortfallSoft. We ship the bool, not the full requirements list (YAGNI: the coordinator would only re-scan it forSame). Skip any shortfall flaggedTopologyConstrained—Same-rack/Same-zone deficits remain within-shard shortfalls forever.In(zone)/In(type)still qualify (they pin attributes, not co-location).
Sort eligible shortfalls by (Priority desc, AgeCycles desc).
3. Cost-ordered escalation (strictly ascending; stop at first tier that covers)
- TIER 1 — reassign idle (cheapest, non-destructive). At age>5. Find
a donor whose Idle subset (FreeMachines, per-type/per-zone counts)
matches the deficit with a retained surplus margin. Emit
TransferOwnershipof Idle machines donor→requester. No cluster moves, no Configure interrupted. - TIER 2 — reassign speculative quota. At age>5 when Tier 1 finds no
idle donor. Find a donor with spare speculative quota for the matching
QuotaKey{Provider,Region}. EmitReassignSpeculativeand commitCmdSetQuotathrough Raft so the moved slice survives failover (quota is authoritative Raft state; the soft instruction is the shard-side materialization, the Raft commit is the durable ledger). - TIER 3 — cross-shard preempt (expensive, rare). ONLY at a HIGHER
sustained-age threshold (~age>12) AND after Tiers 1–2 are exhausted
fleet-wide for that shortfall AND with a strict cost gate: preemptor
priority > donor victim priority AND victim
InterruptionPenaltyBucket< requester’s (interrupt cheap work for expensive demand, never the reverse). EmitCrossShardDrain{selector, preemptor_priority}to the donor; the donor reusesexecuteDrain(execute.go:413) via adecision.Action{Kind: ActionKindPreempt}, scoring its own victims with the existing Phase-2 victim score and deriving grace from preemptor_priority. When the donor’s next summary shows the machine Idle, a subsequent pass emits the Tier-1TransferOwnership. Drain-first is mandatory; CrossShardDrain never moves ownership itself.
4. Machine-id sourcing — donor-resolves (author wire call)
The coordinator addresses a count + (instance-type, zone) selector to the
donor; the donor picks concrete Idle machine_ids honoring its own
reclamation_penalty idle-tiebreak (give up lowest-penalty Idle first) and
echoes them in its InstructAck / next report; the coordinator then issues
TransferOwnership naming those ids to both donor (from_shard_id) and
recipient (to_shard_id), each self-selecting its role by comparing
s.ID(). This keeps the §9 summary budget intact and preserves “shards
own capacity”. It requires adding count+type+zone selector fields to the
instruction messages (or a two-phase offer/resolve handshake) — author
must rule (§Author decisions).
5. Shard handlers (pkg/shard, no coordinator import)
OnCrossShardDrain→ per machine_id, route throughexecuteDrainas aPreemptaction; mark the resulting Idle machine with a Go-onlyreservedForTransferflag (TTL’d) so the donor’s own Phase 3 cannot reclaim it before TransferOwnership lands.OnTransferOwnership→ role-self-select; verifyState==Idle && Cluster==""first (else OUTCOME_FAILED — never force-drain inside a transfer); donorinv.Remove(id)+ suppress re-adoption, recipientinv.Insert(idle, unbound).OnReassignSpeculative→ pure inv bookkeeping (Remove/Insert Speculative rows) mirroring the seed path.
All handlers idempotent against redelivery (the executeBootstrap/executeDelete pattern); any error becomes OUTCOME_FAILED and the coordinator resends.
6. Anti-oscillation (four layered, leader-local soft defenses)
- Age gate (free hysteresis): age>5 (cheap tiers) / age>12 (preempt).
- Post-move cooldown: per-(machine | (type,zone)-flow, shard-pair) timestamp set on every move; that unit is ineligible to move again (either direction) for N≥8 cycles. Kills B→A→B and quota ping-pong.
- Min-benefit + donor-surplus margin: a move fires only if the donor retains margin after donating AND the move closes a meaningful fraction of the deficit. No oscillation around break-even.
- Demand-pull invariant: instructions only ever go to a shard that
reported a matching eligible shortfall, sized ≤ its deficit. Idle is
never PUSHED. The arriving machine lands against real demand so the
receiver’s Phase 1 (before Phase 3) claims it the same cycle — it is
never surplus. The donor-side
reservedForTransferflag closes the symmetric drain-then-reclaim hazard.
Cooldown/age state is soft (lost on failover); acceptable because a new leader re-derives shortfalls from fresh reports and the age gate re-arms. The cooldown duration must exceed the report cadence so within-leader thrash is impossible.
Fencing & failover
Each instruction is stamped instr.CoordinatorTerm = c.RaftTerm() +
instr.SequenceNumber = FSM.NextSequence() + a stable instruction_id
before EnqueueInstruction. The soft pending queue is leader-local and
lost on failover; the new leader re-derives from fresh reports under a
new term, and stale-leader instructions are rejected by the shard’s term
high-water mark. Quota moves (Tier 2) survive via CmdSetQuota. The one
sharp edge — leader dies between CrossShardDrain and its follow-up
TransferOwnership — is bounded by the donor’s reservedForTransfer TTL:
the marker self-clears and the machine returns to the donor’s normal Idle
pool. Worst case is a wasted drain, never an orphan or double-adopt
(provider fencing fences the loser at Configure).
Consequences
- §9 realised with no new instruction types and no new delivery
machinery — the new code is the leader-side loop, three handler bodies,
the
TopologyConstrainedpropagation, and the selector wire field. - Preemption stays “expensive, rare” by construction via the graduated age gate + strict priority/penalty-bucket cost gate; cheap non-destructive tiers carry the common case.
- Static stability preserved by construction — emission in coordinator, mechanical execution in shard, no new import; TestStaticStability_ShardDoesNotImportCoordinator stays green; a coordinator outage pauses rebalancing while the data plane runs.
- Anti-oscillation is first-class, attacking each enumerated thrash mode (B→A→B, quota ping-pong, drain-then-reclaim, escalation churn) and the ADR-0045/ADR-0052 cross-shard analog independently.
- Cheap at 200 shards: soft state unchanged (~240 B/shard summary + ≤100 shortfalls/shard); per-pass O(shards × shortfalls × donor-scan) ≈ low-millions of integer/map ops, sub-millisecond, every ~5 cycles, leader-only; no per-pass Raft except rare Tier-2 quota commits.
- New tunable surface (rebalance-interval, age>5/age>12, cooldown N, donor margin, min-benefit) must be helm-wired and validated against demand realism (ADR-0043) on the uber-* ladder before the thresholds are trusted.
- Two preconditions gate end-to-end function (not the loop): the
donor-resolves wire field, and the ownership-partition source of truth
(provider.List today returns all machines with no shard scoping and
reconcileFull re-adopts everything, so
inv.Removeon the donor is undone next reconcile). Until ownership-partition is chosen, the handlers are correct-but-inert and the loop + gates are testable in isolation.
Alternatives considered
- greedy-shortfall (per-shortfall, locally-greedy). Serves the oldest-highest-priority shortfall first through the same idle→quota→ preempt escalation, but with a weaker (non-global) donor view and a single age gate. Rejected: no global donor selection (can strand a donor that would better serve a younger shortfall) and it does not make preemption rare as strongly as a graduated gate.
- periodic-global-pass (single age gate, all tiers). The closest alternative and mechanically identical; rejected only because one age gate for all three tiers does not enforce “preemption expensive, rare” — the graduated age>12 gate + strict cost gate is the decisive difference the paper’s language demands. Its global-best-donor framing is adopted into the chosen design.
- Per-report inline emission inside ReportShard. Rejected: a single shard’s report gives no global donor view and races other shards’ soft state, amplifying oscillation.
- Coordinator names machine_ids directly (shard reports per-machine Idle ids). Rejected: inflates the summary past the §9 ~240-byte/shard budget and breaks “shards own capacity”; donor-resolves is the chosen path.
- Durable Raft “transfer intent” for in-flight cross-shard moves.
Rejected for v1: couples the loop to Apply latency for a window the
reservedForTransferTTL + re-derivation already bound to a wasted drain; revisit if wasted drains prove material at scale. - Remove the stubs (don’t build §9 rebalancing). Rejected by author 2026-06-19.