The shard controller hot path
This is the code-level deep-dive on pkg/shard — Tier 2, the part of BigFleet that
actually decides and actuates. Read docs/architecture.md first for
the two-tier shape and the Phase 1/2/3 summary, and docs/concepts.md for
the vocabulary (Need, Profile, Shortfall, effective cost). This page goes underneath both:
how one cycle runs end to end, how actions execute off the cycle barrier, how the
operator-facing session multiplexes everything, how the shard reports to the coordinator
without depending on it, and exactly which invariants the code enforces and why. The
decision engine itself (pkg/decision, the Phase 1/2/3 algorithms) is covered in
decision-engine.md and is in scope here only as the cycle calls it;
this is about the controller that drives the engine. The papers are the final authority
(docs/papers/bigfleet.md §6–§11); where the code diverges — notably ADR-0003, now
superseded — this page flags it rather than papering over it.
1. The central invariant: pkg/shard must not import pkg/coordinator
Static stability is the load-bearing safety property of the whole system: clusters keep running with BigFleet entirely down, and shards run autonomously while the coordinator is unreachable or mid-failover (paper §6). The architectural rule that makes that defensible is a single import constraint — the shard’s hot path has no compile-time path to the coordinator. There is then nothing to block on when the coordinator is down, because there is no call from the cycle into coordinator code.
This is verified mechanically, not by convention. pkg/shard/no_coordinator_dep_test.go
imports pkg/shard via go/build and fails CI if any import — production or test — is
prefixed github.com/intUnderflow/bigfleet/pkg/coordinator
(pkg/shard/no_coordinator_dep_test.go:24-33). Confirmed against the live graph
(go list): pkg/shard’s only intra-repo imports are conv, decision, decision/occ,
fencing, grpcutil, inventory, machine, metrics, needs, proto, and provider
— no coordinator, no coordclient.
The dependency arrow points the other way, which is how the shard still gets coordinator service without importing it:
coordclient lives in a sub-package the hot path does not import
(pkg/shard/coordclient/coordclient.go:1-12). It depends on pkg/shard (to read the
shard’s summary and dispatch instructions back), never the reverse. The two are stitched
together at the binary boundary: cmd/bigfleet/shard.go:788-802 constructs the coordclient
with coordclient.ViewFromShard(sh) and runs it in its own goroutine
(go func() { errCh <- cc.Run(ctx) }()), entirely separate from sh.Run(ctx) at
shard.go:770. An empty --coordinator-addr skips the coordclient goroutine altogether
(cmd/bigfleet/shard.go:788) and the shard runs cycles unchanged — single-shard / dev runs
need no coordinator at all.
The two readiness probes reinforce this. /healthz is “the HTTP server answers” and
/readyz is sh.FirstReconcileDone() — the first provider reconcile completing
(cmd/bigfleet/shard.go:812-821). Neither touches the coordinator, by explicit design: a
shard must come up alive and ready with zero coordinators reachable. firstReconcileDone is
latched (atomic.Bool, shard.go:283) so readiness never regresses on a later transient
provider error.
2. The cycle hot path (shard.go)
A shard is one process: one needs.Table, one inventory.Inventory, one
fencing.CoordinatorTerm, a map of operator sessions, and (under Run) a persistent pool
of execute workers (Shard struct, shard.go:224-341). Run (shard.go:487-518) is the
whole controller loop:
Run: spawn max(1, ExecuteConcurrency) executeWorker goroutines, draining actionQueue (ADR-0021) loop select: ctx.Done → return ticker.C → runCycle (CycleInterval, default 10s) wakeup → runCycle (a rollup arrived; coalesced, buffer-1)Cycles fire on a fixed interval and on a rollup-triggered wakeup. triggerCycle
(shard.go:603-608) is a non-blocking send into a buffer-1 channel, so a burst of rollups
collapses into at most one extra queued cycle — the shard reacts to new demand fast without
self-DoSing.
runCycleCapturing (shard.go:627-1010) is the body, shared by production (runCycle,
shard.go:621) and the simulator (Step, shard.go:614-616). One cycle, in order:
- Reconcile inventory against the provider so the engine decides against a fresh view
(
shard.go:648-656; §3). Success latchesfirstReconcileDone. - Snapshot the inventory and NeedsTable (
shard.go:662-674).decision.NormalizeDemandfolds sub-machineSame-Needs into atomic plain aggregates once, against this cycle’s snapshot, so Phase 1, Phase 2 (via Phase 1’s residuals) and the attribution probe all reason over identical normalized demand (ADR-0041).stampParkedNeedsthen marks aged-out structurally-unsatisfiableSame-gangs (ADR-0042 Addendum; §2.2). - Phase 1 / 2 / 3 (
shard.go:676-703). All three run on the same snapshot, so their actions are independent and execute-order between phases is irrelevant. Phase 3 consumesp1.Claimed— the single attribution — and is shrinkage-only (ADR-0045): it reclaims the Configured remainder demand left unclaimed, plus, since M73, emits Delete for Idle machines past theirCapacityTypehold (paper §8 release order). It runs no demand walk of its own, and is gated per-cluster onFirstRollupReceived(ADR-0036; §2.1). - Emit AvailableCapacity hints (
shard.go:766; §7). - Collect, rail-check, and enqueue all actions (
shard.go:777-953; §2.1 rails, §4 pool). - Record shortfalls from Phase 2’s residual (
shard.go:958-962; §6) and emit metrics.
The cycle never blocks on action execution. Phases compute and enqueue; the worker pool drains. A slow operator handler delays a binding; it does not stall the cycle (ADR-0021, §4).
2.1 The three actuation safety rails (ADR-0046)
The rails live at the actuation/ingest boundary — pkg/decision stays pure, computing the
paper-faithful answer every cycle; the rails only govern what crosses from decision to
execution, and what an inbound rollup may do to the NeedsTable
(safety.go:1-12). None ever reorders priority. This placement is load-bearing: the phases
stay testable against the papers, and a rail can change how fast the engine’s answer
executes, never what it is. The §16 tension (“priority is the sole throttling mechanism”)
is resolved in the ADR: the rails bound actuation volume, not allocation among claimants.
| Rail | Where | What it does |
|---|---|---|
| 1 — reclaim blast-radius cap | capReclaims, safety.go:150-193, called at shard.go:793-802 | Per cluster per cycle, at most max(1, ⌊fraction × Configured⌋) Reclaims execute (reclaimCap, safety.go:133-139); production default fraction 0.05 (DefaultReclaimCapFraction, safety.go:37). The kept reclaims are the head of the emission sequence — Phase 3’s paper-§8 release order — so the cap defers the tail, never reorders. Surplus rolls over (Phase 3 re-derives next cycle) and deliberately does not trigger an immediate follow-up cycle, because the cap’s job is to spread risk over wall-clock time. Only Reclaim is capped: Preempts are priority-driven allocation (§16-exempt), Bootstraps/Provisions are acquisitions. |
| 2 — empty-rollup quarantine | rollupGuard.admit, safety.go:100-126, called from ApplyRollup, shard.go:442-470 | A full-replacement rollup retaining < 10% of the cluster’s previously accepted Need rows, when that baseline spanned ≥ 10 rows, is held until 3 consecutive rollups confirm the drop (constants safety.go:42-59). Quarantine, not reject — genuine mass scale-down proceeds after ~2 rollup intervals. The held rollup still clears the ADR-0036 first-rollup gate: the operator did report (ApplyRollup defers markFirstRollupReceived unconditionally, shard.go:443). |
| 3 — global kill switch | Config.ActuationPaused, branch at shard.go:827-843 | The cycle runs in full — reconcile, phases, shortfalls, AvailableCapacity, metrics, probes — but no action executes. Suppressed actions are counted per kind in ShardActionsSuppressed (kept out of ShardActionsTotal), logged, audited. Static config, flipped by redeploy, never by RPC — so the kill switch never depends on the coordinator being reachable. |
A fourth disposition, dry-run / shadow mode (Config.DryRun, branch at
shard.go:844-866), shares the kill switch’s suppression point with the opposite posture:
the day-one adoption mode that reports every decided action (ShardActionsDryRun, Info
log, audit) without executing. When both are set, ActuationPaused wins the counting
(shard.go:826-827 checks it first) — an emergency stop during a shadow run reads as a
pause, not a shadow. Rail 1 and MaxActionsPerCycle are skipped under pause/shadow
(shard.go:793, 810): nothing executes, so there is no drain rate or execute cost to bound,
and MaxActionsPerCycle’s deferral would otherwise busy-loop the wakeup channel re-deriving
a surplus that never drains.
Every executed, suppressed, or dry-run action lands in the decision audit log if configured
(auditAction, safety.go:239-252) — one JSONL record per disposition, carrying the
classified outcome. Note the audit record stamps s.cycleCount.Load(), which for an executed
action is the executing cycle and under the async pool (§4) can trail the deciding cycle.
2.2 Acquisition parking (ADR-0042 Addendum)
A structurally-unsatisfiable Same-gang (e.g. a Same-rack request larger than any rack the
shard owns) would otherwise churn Bootstrap/Reclaim forever — partially assemble, fail to
complete, reclaim the scatter, repeat. The shard tracks per-class consecutive-unsatisfiable
cycles in unsatSameAge (recordUnsatSameAges, shard.go:1121-1143), keyed by
(cluster, group, fingerprint) (unsatSameKey, shard.go:1110-1112). Faithful to
concentrate-then-park: a class only ages when it is unsatisfied and acquired nothing this
cycle — any forward progress (Acquired > 0) or the existence of a satisfiable bucket
(SameSatisfiable, a lost claim race, not structural) resets it. Past parkAfterCycles (8),
stampParkedNeeds (shard.go:1150-1163) marks the Need AcquisitionParked so both phases
make a creditable-only domain choice and stop driving actions; every reprobeEveryCycles
(32) the stamp lifts for one cycle so parked demand re-attempts and naturally un-parks the
moment supply appears (shard.go:1101-1104). The map is bounded by the live unsatisfied
set — entries not re-seen in a cycle are deleted, and only the cycle goroutine touches it
(no lock).
The flag-gated phase-attribution probe (
PhaseAttributionLog,shard.go:705-758) is the diagnostic instrument that found and now watches theSame-domain crediting defects across ADR-0040 → ADR-0051. It is read-only over the cycle’s own snapshot + phase results, default-off, zero hot-path cost when off. Seedomain-attribution.mdfor the saga it instruments.
3. Reconciliation: List + Get, no Watch (reconcile.go)
There is no Watch provider RPC by design — only the six (Create, Configure, Drain, Delete, Get, List). Reconciliation is a List per cycle, run before the decision phases so the
engine never decides against a stale inventory (reconcile, reconcile.go:29-34, called at
shard.go:649). Two modes, gated on Config.IncrementalReconcile:
- Full (default, correct for any provider) —
reconcileFull(reconcile.go:39-62):provider.Listwith an empty filter, apply each returned machine, then walk the inventory snapshot to find machines the provider no longer reports andRemovethem. Linear in inventory size. - Incremental (above the §10.6 conformance threshold) —
reconcileIncremental(reconcile.go:69-81): pumps the opaqueSinceRevisioncursor (reconcileCursor, read/written only by the cycle goroutine, so no lock —shard.go:261) and processes only deltas. O(k changes) instead of O(N). The removal snapshot-walk is skipped — deltas don’t yet communicate “this id is gone” (tombstones are deferred until a real provider needs them). A fresh cursor (cold start, shard restart) returns the full inventory, same as the full path. This matters because the M11.21 phase dump showed reconcile is ~85% of the cycle at 500K (reconcile.go:64-68).
applyReconciledMachine (reconcile.go:116-165) is where a provider machine merges into
inventory, and it carries three load-bearing subtleties, checked in this order:
- Skip in-flight machines (
reconcile.go:129-131). IfisPending(id)— a worker is mid-RPC on this machine (§4) — the provider’sListview lags the in-flight RPC and would overwrite the worker’s authoritative transitional state (e.g. flipConfiguring → Idleunder a runningConfigure, causing the post-Configure transition to fail with “invalid transition Idle → Configured”).bigfleet-uber #23root-caused 17–26% of Provision failures to exactly this race. - State-match fast path (
reconcile.go:132-135). If local state already equals the provider’s, do nothing — no proto round-trip, noApply, no allocation, no Invariant screen. This is the common case at steady state and especially the cycle right after execute fans out (execute already locallyapplyTransition-ed each machine to its ack state). - Shard memory is authoritative for tracked machines (
reconcile.go:139-147). On a genuine state divergence the locally-knownAssigned*fields (priority, the two penalties, the Need fingerprint, the gang group) are preserved — the provider doesn’t own assignment state. For a machine with no local record (genuinely new, or being rebuilt after a process restart wiped memory), the provider-echoedshard_metadatais decoded to restore the assignment state (DecodeShardMetadata,reconcile.go:159-164; M72 — see §4/§5).
Slow-path records are screened by validateProviderMachine (safety.go:204-213) before they
touch inventory — machine.Invariant bounds the provider-declared cost-formula inputs
(price ≥ 0, interruption_probability ∈ [0,1]; the locked formula is
effective_cost = price + interruption_probability × interruption_penalty). A rejected record
is logged, counted (ShardMachinesRejected, with a price / interruption_probability /
structural label from machineRejectReason), and the inventory keeps its last-known-good
state. Because reconcileFull marks the id seen first (reconcile.go:47), a rejection
never masquerades as a removal (ADR-0046 addendum / M70). The state-match fast path is not
screened: it ingests no fields.
Divergence: ADR-0003 is superseded
ADR-0003 (“shard inventory snapshots are eventually consistent on the cycle hot path”)
describes a background fold goroutine and a CycleSnapshot() API returning a stale, O(1)
cached snapshot — chosen for its O(N) build cost (≈700 ms at 500K on M5). The code no longer
works this way. The cycle calls the synchronous, fresh s.inv.Snapshot() (shard.go:663),
and the ADR’s own Status line records it: “Superseded by M44.4 Drop A — the shard cycle
switched to synchronous Snapshot() … fold goroutine and live triple-indexes removed at
M66.1.” The reason: at real write rates, stale snapshots caused ~50% wasted Bootstraps for
already-Configured machines (shard.go:658-661). Per the source-of-truth order, treat the
ADR as historical context for why eventual consistency was considered safe (every phase
action is idempotent against an unchanged snapshot, and applyTransition rejects illegal
re-attempts), not as a description of current behaviour. The cost that motivated it was
instead reclaimed by the incremental-reconcile path above and by inventory’s own pre-built
per-bucket indexes.
4. The persistent execute pool and actuation (execute.go, ADR-0021)
The cycle does not execute actions inline and does not wg.Wait on them. Run spawns
max(1, ExecuteConcurrency) executeWorker goroutines once at start (shard.go:500-503);
each drains the shared actionQueue (sized 2 × concurrency, shard.go:500) and derives
its own per-action context capped by ExecuteTimeout (default 30s), independent of any
cycle deadline (executeWorker, shard.go:525-549). This is the ADR-0021 decoupling:
pre-change the cycle wall-clock was max(action_latency) because the cycle ctx was the
parent of every action and the cycle waited for the slowest worker — a slow operator handler
dragged the whole cycle and, worse, cycle-ctx cancellation cascaded into machine state.
Post-change, throughput is concurrency / mean(latency), and a slow handler delays one
binding. (The scaleway-50k chain in the ADR sat at a ~25 binds/sec ceiling — below the
~41 CRs/sec churn rate — purely from the per-cycle batch barrier.)
The decoupling forces three correctness guards against stale/duplicate dispatch, because an action emitted in cycle T may run in cycle T+5:
pendingActionsledger (pendingMu,shard.go:318-319). A per-machine in-flight set, consulted at enqueue (shard.go:905-912): a machine already pending drops the duplicate. Workers clear the entry when execute returns, success or failure (shard.go:543-545) — a failed action re-derives cleanly next cycle. Also read byisPendingduring reconcile (§3).actionStillApplicable(shard.go:583-599), the “Drop K” live-state guard at emit time. ThependingActionsledger dedups while a worker is mid-flight but not when one just finished between the cycle’s snapshot read and this emit. So at enqueue, re-check live inventory: a Bootstrap needsIdle, Provision needsSpeculative, Reclaim/Preempt needConfigured, Delete needsIdle; otherwise the action is moot (already done, or invalidated by a competing transition) and is skipped (ShardActionsDeduped). It fails open on aninv.Geterror, matching pre-Drop-K behaviour and leaving the worker’s own state check as the final authority. At 50K-Pod cloud scale this prevented 39/sec wasted dispatches.- Idempotent re-application inside executors (“Drop J”,
executeBootstrap,execute.go:268-272): a machine alreadyConfiguredfor the same cluster + fingerprint the action targets is a no-op success — the desired end state already holds. Only a different cluster/fingerprint is a real conflict the state machine refuses.
If the queue is full (workers can’t keep up), the new emission is dropped and re-derives next
cycle (shard.go:919-925, ShardActionsDropped) — fall-through, not blocking. Step/cycles
called outside Run (simulator, tests) have a nil actionQueue, so they fall back to inline
serial execute (shard.go:934-942); production never hits this.
The state machine each executor walks
Machines have eight states — three stable (Speculative, Idle, Configured), four
transitional (Creating, Configuring, Draining, Deleting), and Failed
(pkg/machine/machine.go:32-41; only stable states are allocatable, IsStable,
machine.go:69-71). execute (execute.go:45-83) dispatches by Action.Kind; each handler
walks transitional states and calls the matching provider RPC, using applyTransition
(shard.go:1399-1440) which validates against the legal-transition table
(validTransitions, machine.go:235-247) and emits a NodeStateUpdate (§5) on every change.
executeProvision(execute.go:181-235):Speculative → Creating,provider.Create, reflect the ack’s host/profile/price/interruption into inventory. The Create ack is where provider-declared cost-formula inputs enter; it isvalidateProviderMachine-screened, and garbage drives the machine to Failed (execute.go:216-221). Hand off to Bootstrap if Create reached Idle.executeBootstrap(execute.go:242-399):Idle → Configuring, stamping the destination cluster + Need fingerprint + gang group early (execute.go:286-294) so Phase 1 counts the in-flight machine as supply for its own Need — without this, Phase 1 over-emits Bootstraps on fresh Idle machines for the same fingerprint (“Drop G”: 318K of 522K emits were dedup-skipped). Then it pulls a kubelet bootstrap blob — via the operator’sShard.Sessionstream (sess.requestBootstrap, production) or theLocalBootstraphook (simulator/test) — callsprovider.Configure, walksConfiguring → Configured, and stamps priority + both penalties for Phase 2 victim scoring (execute.go:363-372). TheConfigurerequest also carriesEncodeShardMetadata(priority, intPen, recPen, fingerprint, group)(execute.go:352): Configure is the only writer of the provider-side echo, so a restart can rebuild protection state fromList(§3). A blob-fetch timeout rolls back to Idle for retry; a real failure goes Failed (handleBootstrapBlobErr,execute.go:94-105) — the distinction matters because the blob fetch happens beforeprovider.Configure, so rollback leaks no provider state.executeDrain(execute.go:405-463) handles both Reclaim (Phase 3) and Preempt (Phase 2): notify the operator first (sendReclaimInstruction) so the cordon + PDB-respecting eviction runs ahead of the provider drain (ADR-0009), thenConfigured → Draining,provider.Drain,→ Idle, clearing allAssigned*fields. On a voluntary ReclaimPreemptorPriorityis 0 — there is no preemptor (execute.go:415-427); if no session is connected the provider drain still runs (kubelet default grace) but the skipped PDB pass gets its own alertable log line.executeDelete(execute.go:475-516): the M73 paper-§7 idle release —Idle → Deleting,provider.Delete,→ Speculative(the host is released entirely; only the quota slot survives,ShardIdleReleases). ADeleterejection (incl.ErrNotSupported) means the provider’sCapacityTypedeclaration and its Delete support disagree — a contract violation, so the machine goes Failed, not retried.
classifyExecuteError (execute.go:114-176) maps every return path to a bounded outcome
label for ShardActionExecuteOutcomes. Two are worth knowing: transition_raced_to_idle
(errProvisionRacedToIdle, execute.go:38) is the benign Configuring→Idle race that recovers
next cycle, kept distinct so alerting on real transition_error stays meaningful; and
fenced (outcomeFenced, execute.go:110, matched via provider.ErrFenced before the
message-string buckets so it can’t be miscounted as provider_error) is a provider rejecting
our fencing token (paper §11) — logged at Error as a zombie-shard incident, do not retry.
5. The operator-facing session (session.go)
Operators are outbound-only: each dials the shard and holds one bidirectional
Shard.Session stream; the shard has no inbound listener on the operator and never opens an
outbound connection to a cluster. All cluster ↔ shard traffic multiplexes on that one
stream. Session (session.go:36-108) is the server side.
- First frame is
Hello(session.go:38-46), carrying thecluster_id. On an mTLS transport the client certificate’s URI SAN must matchClusterURI(cluster_id)(ADR-0048,session.go:54-63) — otherwiseHello.cluster_idis a free-text impersonation vector (receive another cluster’s reclaim instructions, or zero its capacity with a forged full-replacement rollup;ShardSessionIdentityRejectedcounts failures). Plaintext transports skip the check; identity is only as strong as the transport. - At most one session per cluster. A new connection replaces and closes the prior one
(
installSession,session.go:147-158);removeSessiononly deletes the map entry if it still points at this session (session.go:162-178), so a replaced session’s deferred cleanup can’t evict its successor. - Split recv pump from handling (
session.go:94-108;routeOperatorMessage,session.go:119-143). The recv loop only routes; fast paths (BootstrapResponse, ReclaimAck, Hello-ack — channel sends / trivial Sends) run inline; the slow path (Rollup — decode +needs.Replaceover full demand + cycle trigger) is handed to a per-session goroutine viaenqueueRollup. This is M44.4 Drop B: a slow rollup running inline used to block delivery of the very BootstrapBlobResponses an in-flightexecuteBootstrapwas waiting on, pushing machines to Failed for an orchestration timeout unrelated to them. Per-session ordering is preserved within each lane (one rollup goroutine); cross-lane ordering is intentionally relaxed.
Rollup ingest (rollupWorker, session.go:250-291). The rollup channel is buffer-1
because rollups are full replacement (paper §3.1): only the latest matters, so a newer
arrival supersedes a queued one (enqueueRollup drains-and-replaces, session.go:226-244).
The worker decodes (conv.NeedsFromRollup; a malformed rollup is rejected loudly,
ShardRollupsRejected, and the NeedsTable keeps last-known-good demand — M68b,
session.go:262-272), then sh.ApplyRollup (shard.go:442-470) replaces the cluster’s
NeedsTable slice and marks the ADR-0036 first-rollup gate. ApplyRollup is the single
ingest point — both the production session path and the sim/test runners flow through it, so
rail 2 (empty-rollup quarantine) has one implementation and one test surface. A held rollup
still marks first-rollup but returns false, so the cycle isn’t triggered and
demandObservedAt keeps tracking the accepted fingerprints, not the held ones
(session.go:280-283).
Outbound frames ride the same stream:
requestBootstrap(session.go:317-343): sends aBootstrapRequestwith a minted crypto-randomrequest_id, registers a buffer-1 response channel, and blocks on it (or ctx) until the matchingBootstrapBlobResponsearrives (deliverBootstrapResponseroutes by id; late responses drop silently,session.go:348-361).sendReclaimInstruction(session.go:367-381): fire-and-forget the drain instruction; ack tracking is best-effort telemetry only.SendNodeStateUpdate/SendAvailableCapacityUpdate(session.go:403-417): coalescing frames whosesupersedes_key(node:<id>/available:<fingerprint>) lets the operator drop stale queued frames safely on reconnect.
NodeStateUpdate carries node identity (ADR-0016)
notifyNodeState (shard.go:1449-1505) fires from applyTransition on every state change of
a cluster-bound machine. Beyond (machine_id, state, provider_id, last_error), it carries the
node’s shape — labels (including the synthesised node.kubernetes.io/instance-type and
topology.kubernetes.io/zone) and per-machine Allocatable resources — so the operator can
populate UpcomingNode.Spec with what the kubelet will register (ADR-0016,
shard.go:1471-1501). Empty values are valid for transitional states with no bound host. Two
details earn comments in the code:
- It sends
EffectiveAllocatable()(per-machine capacity), notProfile.Resources(per-replica request shape), so density>1 bin-packing works (ADR-0022/M45.5,shard.go:1489-1501).EffectiveAllocatablefalls back toProfile.Resourcesfor pre-M45 inventory, keeping the historical 1-Pod-per-machine math. - The bound cluster is captured before the mutating callback runs (
prevCluster,shard.go:1420), because a terminalDraining→Idletransition clearsMachine.Clusterand the update must still route to the cluster whoseUpcomingNodeGC keys on it (M68b); otherwise every drained machine leaked a stale Draining-phaseUpcomingNode.
Delivery is best-effort — a missing or disconnected session is silently skipped; the operator reconciles from full state on reconnect.
6. Reporting to the coordinator (report.go) and the off-hot-path client (coordclient/)
The shard’s contribution to coordinator state is built by pure read methods in report.go,
which import nothing from coordclient or coordinator — they are plain accessors over the
shard’s own state:
Summary(report.go:23-43): a coarse inventory snapshot (total / free machines, per-instance-type and per-zone counts). Bounded; not the per-machine inventory.Shortfalls(report.go:70-94): the unresolved demand this shard couldn’t satisfy from its own inventory and couldn’t resolve via in-shard preemption — sorted by priority desc (tiebreak: older first) and bounded to the top 100 (paper §3.5 shortfall protocol bound). Shortfalls are a resource vector (Deficit), not a machine count (ADR-0027).recordShortfalls(report.go:114-148) ages entries per cycle and drops those no longer present (eventually-consistent). Distinct unresolved Needs sharing a Profile fingerprint — several parked gangs of one shape differing only by co-location Group — are summed per fingerprint (resource-vector add) and aged once per fingerprint per cycle (M68); the ledger stays fingerprint-keyed because its only consumer, the coordinator, seeks supply by Profile and can’t act on per-gang identity.
Topology constraints never cross shard boundaries: a Same-rack request unsatisfiable
in-shard becomes a shortfall and is reported up; it is never resolved cross-shard (a BigFleet
hard rule). The coordinator may rebalance supply between shards, but it does not resolve one
shard’s topology constraint using another’s machines.
coordclient (coordclient/coordclient.go) is the only shard-side code that knows the
coordinator exists, and it runs in its own goroutine off the hot path. Client.Run
(coordclient.go:157-187) dials the coordinator (mTLS per ADR-0048, with a shard URI SAN),
fires one ReportShard immediately, then ticks every ReportInterval (default 30s, per
paper). Each runOnce (coordclient.go:191-235) builds a ShardReport from the ShardView
(the coordclient.ViewFromShard adapter delegating to *shard.Shard, adapter.go:14-59),
sends it, validates the returned coordinator_term, and dispatches any returned instructions
— rebalance instructions ride back on the shard-pulled report; there is no inbound RPC to
the shard. Static stability is explicit in the failure handling: a ReportShard error
re-queues the pending acks and returns; the shard’s data plane continues against existing
allocations regardless (coordclient.go:212-220).
Shard self-registration via heartbeat (ADR-0006)
There is no registration RPC. The ShardReport carries ShardAddress (= the
configured AdvertiseAddress, coordclient.go:200-209); on the first report from an unknown
shard_id the coordinator Raft-Applies AddShard{ID, Address} itself, then subsequent
reports take the cheap heartbeat path. One coordinator RPC for the whole shard lifecycle —
the shard already pulls instructions via ReportShard, so a separate Register would mean
two coordinator interactions where one suffices (ADR-0006). Registration succeeding or not is
bookkeeping that lets domain assignments and admin tooling see the shard; the shard runs
cycles either way.
Inbound instructions are fenced and idempotent (handleInstruction, coordclient.go:242-291):
each is deduped against doneInstructions (a redelivery just re-acks, ACCEPTED), validated
against the CoordinatorTerm high-water mark (stale → REJECTED_STALE), then dispatched to a
ShardView handler. AssignDomain/UnassignDomain mutate the shard’s assignedDomains set
(report.go:158-190; empty = “take everything”, dev/single-shard mode);
ReassignSpeculative/CrossShardDrain/TransferOwnership are M6.3 stubs awaiting the
rebalancing logic (adapter.go:70-74). The coordinator may redeliver the same instruction
across reports until it sees an ack — hence the dedup ledger and the re-queue-on-error path.
7. AvailableCapacity reporting (available_capacity.go)
Once per cycle, after Phase 1/2/3, emitAvailableCapacity (available_capacity.go:130-164)
pushes one AvailableCapacityUpdate per (cluster, distinct demanded profile fingerprint)
down the matching operator session — a paper §6.2 eventually-consistent hint the operator
surfaces as an AvailableCapacity CR. Emission is keyed on the cluster’s demanded shapes
only (not every profile inventory could match), so CR cardinality on the cluster apiserver is
bounded by the cluster’s distinct workload shapes, not by inventory size.
The hint is cheap to compute — buildAvailableCapacityUpdate (available_capacity.go:170-210)
reads pre-built per-(state, instance-type) snapshot buckets (each Count is O(1)), no
per-machine walk — and reports a confidence ladder: HIGH if an Idle machine matches (no
wait on Create), MEDIUM if only a Speculative slot matches, NONE otherwise. Per
ADR-0027 a Profile carries no single resource shape, so the hint carries
requirements + count + cost + confidence, not a Resources value.
The churn control is availableCapacityCache (available_capacity.go:36-104), two layers:
coalesce (skip if the (count, confidence, cost) tuple matches the last emit) and
rate-limit (at most one emit per minInterval, default 5s, per (cluster, fingerprint)).
Without it, a 50-cluster harness at ~10 Hz cycles would emit ~500 AC writes/sec into the
operator-side kine sqlite — dwarfing the load AC itself describes. The cache survives
reconnects per-cluster: removeSession calls acCache.forget(cluster) (session.go:175-177)
so a reconnecting operator that lost its CRs gets a fresh emit, while other clusters keep
their dedup state (no fleet-wide re-emit storm).
8. Provisioning latency (provisioning_latency.go)
Paper §10.7 end-to-end provisioning latency = time from a demand first appearing in a rollup
to a machine reaching Configured for it. observeRolledUpDemand (provisioning_latency.go:15-40)
stamps the first-seen time for each (cluster, profile fingerprint) at rollup ingest (called
from rollupWorker, session.go:281) and prunes fingerprints no longer demanded.
observeProvisioningLatency (provisioning_latency.go:58-67), called from the Configure
success path (execute.go:395-397), records time.Since(first-observed) into
ShardProvisioningLatency — then deletes the tracking entry (observe-once). The delete is
load-bearing: without it, every Configure during a 30-minute soak measured ~the soak duration
as latency, saturating the histogram’s +Inf bucket and pinning the metric to a constant
327.68s. observeRolledUpDemand re-inserts the entry on the next rollup that still demands
the fingerprint, so the next sample measures the latency for the freshly re-observed demand.
Best-effort — silently skipped when the fingerprint isn’t tracked (e.g. an action from
Phase 2/3 paths that aren’t gated on rolled-up demand).
9. Where to read next
decision-engine.md— the Phase 1/2/3 algorithms the cycle calls; effective cost, victim score, drain grace, theClaimed/Unsatisfied/SatisfiedGangsresults this controller consumes.machine-lifecycle.md— the eight-state machine and the legal transitionsapplyTransitionenforces.static-stability.md— the design class the no-coordinator-import rule rules out, end to end.wire-protocols.md/provider-protocol.md— theShard.Sessionstream, the six provider RPCs,supersedes_keycoalescing.coordinator-raft.md— the other end ofcoordclient: the Raft FSM,AddShard, domain assignment.- ADRs: 0003 (superseded — read for the eventual-consistency rationale), 0006, 0016, 0021, 0036, 0045, 0046.