The domain-attribution saga (ADR-0040 → ADR-0051)
A Same-Profile Need — a co-location gang that must land in one topology domain — is the one demand shape where BigFleet’s “decide whether to fulfill, never model packing” contract (ADR-0045) does not give a stable claimed-set for free. Over a year of scale-test diagnosis the engine chased a single pathology — a sustained Bootstrap≈Reclaim lockstep at static demand, with shortfalls=0 and domains or machines flip-flopping cycle to cycle — down through four granularities, each time declaring “the last layer” and each time defeated by the same sub-binding perturbation (#64: the engine’s own ~3-cycle bootstrap dwell). This doc is the why-grounded story of that descent: what the bug actually was at each layer, why each fix was necessary-but-insufficient, and the final mental model. It assumes decision-engine.md (the three-phase split, the single-attribution contract) and phase1-occ.md (the seed pre-pass, ChooseSameBucket, the OCC broker); the saga lives almost entirely in pkg/decision/occ/seed.go and samebucket.go.
The shape of the bug, stated once
Every layer of this saga is the same failure mode with a different incumbent unit. A gang is served — its machines are bound, demand is covered — yet each cycle the engine re-derives where the gang should live and reaches a different answer than last cycle. The previous answer’s machines are now off-domain (or off-set) strays; Phase 3, which reclaims any Configured machine the cycle’s attribution walk left unclaimed (phase3_reclaim.go, ADR-0045 shrinkage-only), faithfully reclaims them; the reclaim mutates the supply totals; next cycle the re-derivation lands somewhere else again. Bootstrap and Reclaim run in lockstep forever at a static demand that should be a fixed point. The fix at every layer is the same principle — the domain/machine choice must follow the gang’s existing bindings, never lead them — applied at progressively finer granularity, because the attribution metadata the chooser reads kept being too coarse to express “this machine is mine.”
The reason it is hard, and the reason it was misdiagnosed repeatedly, is that the perturbation that re-triggers the re-derivation each cycle lives below the binding layer. ADR-0045 says only bound capacity counts; a machine still Creating or Configuring is not yet a binding. With configure-phase p99 (~4.5–9.8s) exceeding the cycle (~3.3s), a handful of machines are persistently in flight (#64), so the acquirable snapshot the chooser ranks on genuinely moves every cycle. A deterministic synchronous simulator freezes that pool and self-damps to a fixed point — which is why every offline confirmation in this saga was a false green and the devpod was the only arbiter (see scaletest-harness.md).
ADR-0040 — unify attribution: every crediting site is domain-aware
The first defect was not subtle once instrumented: the engine evaluated Same with two different semantics depending on where it asked. Acquisition (occ.FindSame) was strict — bucket candidates by the Same key’s value, take from the single best bucket only, a gang is served by one domain or not at all (match.go keeps MatchProfile’s Same case deliberately per-machine and vacuous; group-wide co-location is the autoscaler’s job, not a per-machine match). But crediting — Phase 1’s seed pre-pass and Phase 3’s old claimMatching — walked machines through bare MatchProfile, crediting a Same-Profile’s supply across domains. claimMatching’s comment claimed “identical attribution rules, so the two phases agree”; for Same-Profiles that invariant was simply false.
The uber-5k probe measured the consequence directly (ADR-0040 Context): 93.5% of the shard’s Needs were co-located; ~570 were reported unsatisfied every cycle; and the discriminating probe — “Phase 3 reclaimed a machine matching a Need Phase 1 declared unsatisfied this cycle” — read zero throughout. The two phases never even disagreed about the same machines. Phase 1 (strict) saw scattered per-domain supply as unsatisfiable and kept bootstrapping toward gangs it could never finish; Phase 3 (vacuous) saw the same Needs as satisfied cross-rack and reclaimed the non-co-located over-provision Phase 1 created. A self-sustaining ~8–21/sec equilibrium, Configured +77% over seed, a shortfall that never closed.
ADR-0040’s decision: make every supply-crediting site domain-aware, mirroring FindSame’s bucket-and-pick rule. The unsatisfiable residual is a shortfall (paper §16: a Same request unsatisfiable within a shard becomes a shortfall; §9 ages and escalates it), never a churn source. No new wire field, no gang-mode bit.
The Addendum, the same day, found the cascade reduced but not ended (~14/sec floor): the domain was still being chosen twice per cycle, independently. The seed pass chose the Need’s domain over creditable supply; findCandidatesFor then called FindSame over acquirable supply with no anchor to the credited domain, so acquisition independently picked the best Idle bucket — typically a different domain, per-domain supply being shallow. Phase 1 itself assembled a cross-domain group; Phase 3 reclaimed the half that didn’t match next cycle’s single-bucket credit. The completion: choose the domain once per Need per cycle, jointly over creditable (the cluster’s Configured + Configuring) and acquirable (shard-wide unclaimed Idle + Speculative) supply, record it on state, and confine both credit and acquisition to it. Phase 3 mirrors the identical joint scoring. This is the architecture that survives to today: ChooseSameBucket ranks joint bucket totals, the choice is recorded in seedSameProfile, and the single choosing site means the phases agree by construction rather than by mirroring (seed.go:122-265).
ADR-0041 — fold sub-machine gangs into atomic aggregates
The next signature was a starvation, not a churn: bind 97.5% (the scheduler placed and packed everything) but 78% of gangs reported unsatisfiable, Configured +48% toward one-machine-per-gang. The claim ledger is machine-granular and exclusive — it claims a whole machine per Need and subtracts its full Allocatable. That was truthful for the old demand shape (a few large aggregated Needs per fingerprint). But ADR-0024 made each co-location group its own Need, and ADR-0039 made every Pod carry a CR, so every gang is visible — ~2,400 gang Needs at uber-5k, and at density 10–100 a gang of 3–8 Pods is a fraction of one machine. Exclusive claiming up-rounds each sub-machine gang to a whole machine: ~2,400 gangs demanding ~2,400 exclusive machines where ~540 exist, while kube-scheduler happily packs many gangs per node.
The resolving insight: a gang whose entire aggregate fits on one machine does not need the Same machinery at all — any single machine with room hosts it, so co-residency is automatic, and the wire already has the atomicity primitive (Fleet-Scale Kubernetes §7: min_unit is “the largest atomic schedulable unit”). So the shard runs decision.NormalizeDemand per cycle before the phases: a Same-Need whose AggregateResources fit at least one matching machine in the snapshot is foldable — strip the Same requirement, sum the aggregates, set min_unit = one gang’s aggregate. Same survives only for the genuinely cross-machine case: a gang too large for any machine. This restored the OCC exclusivity assumption (folded aggregates do consume whole machines) and collapsed Need cardinality from O(co-location groups) to O(fingerprints × gang sizes).
ADR-0041 also carries the first sticky-domain rider (decision 3), the seed of every later layer: among satisfiable buckets, ChooseSameBucket prefers one containing creditable supply over an acquirable-only one before the smallest-total comparison — a Need’s currently-serving domain must not lose to a fresh idle domain that merely scores smaller. Staying put costs nothing (excess within the domain is still reclaimed by the claim loop’s stop-when-covered). Crucially the preference is confined to the satisfiable regime — among unsatisfiable buckets the most-covering rule must keep winning (concentrate-then-park), or a 2-machine serving domain would pin a Need away from a 3-machine idle domain it genuinely needs. That confinement is exactly the gap ADR-0042 had to close.
ADR-0042 — unsatisfiable-domain choice sticky at equal coverage
The ADR-0041 residual was genuine multi-machine GPU gangs (2–256 whole a3-highgpu-8g nodes, rack-coherent) that no single rack can host. The intended behaviour is the Addendum’s concentrate-then-park: assemble what the best domain allows, hold the rest, age in the shortfall buffer, go quiet. The #56 diagnostic proved they never went quiet, and named the mechanism precisely: in the unsatisfiable regime the domain choice has no incumbent preference (the ADR-0041 rider was satisfiable-only). At uber-5k the GPU racks are dozens of identical-total buckets, so most-covering ties constantly; 20 clusters’ sequential claim walks perturb which blocks look acquirable to whom; the count/lexicographic tiebreak resolves differently cycle to cycle. The gang abandons its partial assembly, acquires toward a different identical rack, Phase 3 reclaims the stranded machines — both halves of ~27/sec churn, sustained forever by ~190 unsatisfiable gangs. The closed-loop sim’s park tests passed because their shapes reach zero-acquisition and stop; the cloud’s contended scatter keeps acquisition narrowly non-zero. “The missing piece is not a new suppression state — it is that the domain choice has no memory exactly where memory is what parking means.”
ADR-0042’s decision is one rule, stateless like every rule in the chooser: in the unsatisfiable regime, switch domains only for strictly greater coverage. Among buckets of equal capped coverage, the one holding the Need’s creditable supply (its concentrated partial assembly) wins before count/value. The incumbent signal is CreditableCount > 0, already in the joint fold — no per-Need state, no aging threshold, priority still the sole throttle. This is rule 5 in ChooseSameBucket today (samebucket.go:177-183): only reachable for unsatisfiable pairs of equal joint coverage (satisfiable pairs of differing creditable coverage are caught earlier by rule 2). ADR-0042 was, honestly, cloud-decided/sim-pinned — the deterministic sim resolves the tie identically every cycle and cannot express the perturbation that flips the cloud, so the discriminator was one cloud run with the per-gang probe.
The addendum that became the cautionary tale: aged-acquisition parking
The cloud run came back PARTIAL: churn cut ~3× (≈27 → ≈9/sec), acquired=0 in 23/25 samples, but domains kept re-selecting because per-domain acquirable totals shift slightly every cycle, so coverage is rarely exactly equal and the strictly-greater branch keeps firing on marginal deltas. Exact-tie pinning was too narrow to hold under perturbation. ADR-0042’s own escalation path engaged, shipping three pieces together (the addendum):
- Group identity crosses the wire —
CapacityNeed.group(field 9,api/proto/bigfleet/v1alpha1/capacity.proto:92), one opaque value per gang. Until then per-gang bookkeeping fell back to fingerprints and the probe data was class-level. - Aged acquisition parking — the shard counts consecutive cycles a Same-Need class is Phase 1-unsatisfied with zero acquisition AND no structurally satisfiable bucket (so concentrate-then-park is faithful: a gang still concentrating never ages). At
parkAfterCycles(8) the class is stampedAcquisitionParked; every supply site honours the stamp — the seed pass folds creditable-only (seed.go:217-225:foldAcquirableis skipped, the incumbent wins trivially, acquisition is empty). - Re-probe — every
reprobeEveryCycles(32) a parked class un-parks for one cycle so parking is never forever.
This is the cautionary tale behind the repo’s demand-realism discipline. The complexity audit later traced this entire layer — rule 5, the parking ledger, two tunables, the SameSatisfiable plumbing, the wire field — to one catalog archetype: gpu-training-large demanded 64–256 rack-coherent whole-GPU nodes against racks that physically hold ~50. Every step was locally rigorous; the unasked question was the first one: would a production fleet ever emit that demand? Real systems place gangs that size at zone/pod scope. The trigger was a harness artifact. That is ADR-0043 (“Demand realism before mechanism”): any ADR whose motivating evidence is harness-observed must first answer, concretely, whether a production fleet would emit the triggering shape — citing the papers, a real-world reference, or physical constraints, because “the catalog generates it” is not an answer. The section is a gate, not a formality, checked the way “does this introduce a hot-path coordinator dependency?” is checked for pkg/shard. Diagnosis needs no gate; only the decision to change engine behaviour does. (the project’s working-discipline list carries the rule; the parking layer survives in code as a correct mechanism against a since-corrected catalog, the standing reminder of what unforced mechanism costs.)
ADR-0045 — capacity counts iff bound; the model that should have ended it
ADR-0045 is the keystone of this whole arc, reached in design dialogue with the author (see reasoning/REASONING-LOG.md §1 — the most expensive-to-reconstruct reasoning in the repo). Its origin was a sibling pathology: at a catalog fill’s tail the shard reported p1_unsatisfied=0 while the cluster’s scheduler held unplaceable pods on fragmented residuals, and Phase 3 reclaimed machines hosting bound pods under unchanged demand. The first draft read this as “consumed capacity is invisible” and proposed feeding per-machine consumption vectors in. The author rejected it as scheduler-shadowing — residual math, consumption vectors, anticipating whether the cluster’s scheduler can use bound capacity all make BigFleet a second scheduler, the cardinal hard-rule violation. Two more over-engineered proposals (grace windows, recent-delivery discounting; an “unmet” telemetry signal) were each cut. The result is strictly smaller than anything reached for:
Capacity counts for a cluster iff it is bound to that cluster. Binding (Configure) is the atomic act of fulfillment; the machine state machine is the only supply ledger. Phase 1: bound < demand → fulfill the difference; bound ≥ demand → done. A binding counts from the moment it’s made (before the node exists), so double-supply is impossible by construction — no grace windows, no in-flight discounting, no second ledger. Phase 3 reclaims only on demand shrinkage. Satisfied-but-stuck is the cluster’s problem (the sanctioned home for that smartness is the operator, the demand-side mirror of out-of-tree providers).
For this saga ADR-0045 did two things. It implemented the single-attribution contract as a deletion (M67: Phase 3 340→106 lines; seedSameProfile becomes the engine’s only crediting site, seed.go:18), making the two phases incapable of disagreeing. And it reframed the sticky-domain rider in its true terms: “a bound machine IS the fulfillment, so the domain choice follows the bindings.” M77f rewrote rule 2 to rank satisfiable domains by creditable coverage of the deficit (capped) rather than presence — the domain holding most of the gang’s bound supply wins (a fully-bound domain at coverage 1.0 cannot lose, whatever the slack around it). M77f went green (#62) and the author’s “removed by construction” looked complete.
It was a false green. M77f’s confirmation ran against M66.2’s phantom cpu:800 GPU machines (one gang per node trivially over-covered cpu/mem, so capped coverage was GPU-dimension-only and stable). ADR-0050 then gave GPU machines their realistic cpu:8 / one-pod-per-node shape — the whole point of a realism catalog — and the oscillation returned (#63, repro’d twice on an unloaded box). M77f was necessary but not sufficient under realistic packing; the sim/gang_fixedpoint_test.go pin that “confirmed” it modelled the old packing.
ADR-0051 — gang-granular domain attribution (M77g), and machine-granular (M77h)
The structural gap M77f’s false green hid: CreditableTotal is cluster-granular. It sums the cluster’s Configured/Configuring in a domain, not this gang’s. When two domains both fully cover the gang (a tie at the capped-coverage ceiling), the engine cannot distinguish “this domain holds my gang’s machines” from “this domain holds an equal count of an unrelated same-class gang’s machines,” and the tie falls through to live acquirable slack (rule 3) — which moves every cycle under the bootstrap dwell. AssignedNeedFingerprint (M72) carries only Profile.Fingerprint, so two same-profile gangs are indistinguishable; that defeats any profile-granular fix too (same-class gangs steal each other’s domains — the offline “steal loop” the M77g probe hit).
The #64 field diagnostic established the perturbation is endogenous and production-real, not a harness artifact: ~7 machines persistently Configuring because configure p99 > cycle time, so a bootstrapped machine spans >1 cycle and the acquirable snapshot genuinely moves. A real cloud provider’s minutes-long bootstraps make it worse. It is invisible offline because instant transitions freeze the pool and the deterministic chooser self-damps — the false-green generator. So this was a true engine signal; the dev-50 gate’s red was correct (the gate caught it on day one, M77a, named the oscillation — that is the gate working; “don’t fix the gate to pass”).
ADR-0051’s decision implements ADR-0045’s principle at the granularity the bindings actually have:
- Record the gang on the binding. An additive
shard_metadatakeybigfleet.lucy.sh/assigned-group(machine/shardmetadata.go:28), populated at Configure-time fromNeed.Group, store-and-echoed by the provider verbatim (M72’s pattern — no new RPC, no wire message beyond the existing map). The machine gainsAssignedGroup(machine/machine.go:199), set alongsideAssignedNeedFingerprint, cleared on drain. - Break capped-coverage ties on gang-own coverage.
seedSameProfilecomputesCreditableOwnTotalper domain — the sub-total ofCreditableTotalfrom machines whose(profile, AssignedGroup)match this Need (seed.go:199-202:own := m.AssignedGroup == n.Group && m.AssignedNeedFingerprint == n.Profile.Fingerprint()).ChooseSameBucketadds rule 2b (samebucket.go:158-168): among domains tied at the creditable-coverage ceiling, prefer the one holding more of this gang’s own bound machines, above the acquirable-slack tiebreak.
Because Configuring machines carry the attribution from the moment of Configure, a gang’s in-flight bootstraps count toward its own domain — the choice is stable through the bootstrap dwell that is the perturbation. It reads current bindings (the machine state machine), so it is self-correcting and is explicitly not cross-cycle memory. This is why it is C, not the alternatives: Option B (carry the chosen domain across cycles) is a genuine second ledger ADR-0045 forbids, and fragile (a remembered domain can be wrong); Option A (stateless uncapped-coverage tiebreak) is provably insufficient because creditable is cluster-granular — the granularity is the bug, and A doesn’t fix it. ADR-0051 ships sim bootstrap-dwell fidelity (machines stay Configuring for N cycles) so the oscillation is finally sim-reproducible (true red → green), closing the false-green hole.
M77h — the same principle one granularity finer (machine selection)
Pinning the domain was necessary but not sufficient. The decisive field run (#65, on the M77g build) confirmed the domain flap was gone — 0 domain flips, 4/16 gangs at a complete fixed point — yet the gate stayed red (reclaimActionsDuringSoak ≈ 311). The residual: 12/16 gangs held their domain but rotated which machines they claimed within it. “Domain follows this gang’s bindings” was true; “the machine set follows this gang’s bindings” was not.
The driver (ADR-0051 addendum): the credit/claim pass claims a domain’s machines in keep-priority order (Configured before Configuring, then price asc / reclamation_penalty desc / ID asc) under stop-when-covered. A non-incumbent machine maturing Configuring → Configured jumps from the back of the walk (the Configuring section) to its sorted position in the Configured section; if that position falls inside the first-N-covering subset it bumps an already-serving incumbent out of the claimed set → unclaimed → Phase 3 reclaims it → it re-bootstraps. The same lockstep, at machine granularity rather than domain.
M77h: under stop-when-covered, prefer this gang’s own incumbents before the keep-priority sort decides among non-incumbents. seedSameProfile marks each seedCandidate.own (the same (fingerprint, Group) predicate, already computed for CreditableOwnTotal), and the claim loop applies incumbentFirst — a stable partition that moves incumbents ahead while preserving keep-priority order within each group (seed.go:249, 267-311). So a gang keeps its current machines and only the marginal (deficit) selection draws from the sorted fresh pool; a maturing equivalent can no longer bump a serving incumbent. The stability is load-bearing: when a gang’s own incumbents themselves exceed the deficit, the partition preserves their keep-priority order, so the excess it sheds is still the §8 release-order tail — the invariant ADR-0045 ties to the unclaimed remainder. It reads current bindings only; ADR-0045’s no-second-ledger rule holds, exactly as the domain tiebreak does. incumbentFirst returns the slice unchanged when the partition is a no-op (no incumbents / all incumbents / already-ordered), so the dwell-free and no-attribution paths stay byte-identical and allocate nothing.
This is the final binding-granularity the claim pass has: the engine attributes supply at domain (M77g) and machine (M77h); within a machine there is nothing finer to follow. One residual is deliberately left (ADR-0051 addendum): a gang over-covered by its own machines (all the same attribution) still re-picks which N to keep as an own machine matures, because attribution cannot disambiguate equally-attributed machines — but that is genuine over-coverage resolving in a single §8 shed, not perpetual churn, and not a realistic steady-state shape (the engine does not bootstrap more machines for an already-covered gang). Per ADR-0043 it is not worth a cross-cycle mechanism.
The residual, and why it is a gate decision not a bug
M77h was field-verified ACTIVE but insufficient for zero reclaims (#66: reclaimActionsDuringSoak ≈ 340 on dev-50, ~1.3–1.8× the Configured fleet per soak). Small gangs reach a perfect fixed point — the fix works; large gangs still churn marginally. After three sim agents (~60 configs) and six field runs the residual was characterised by elimination (LIVE-STATE.md): not a domain-choice bug (0 flips in clean runs), not machine-selection over-acquire (surplus lands at Idle, not Configured, so Phase 3 — which reclaims unclaimed Configured — never sees it; the over-acquire is a separate Speculative/Idle-tier efficiency item, parked), not cross-claim, not blast-radius-cap-created, not external micro-churn. It is the endogenous async in-flight self-perturbation: the engine never fully converges (~4 persistently Configuring), and its own acquisition maturing asynchronously re-perturbs the chooser each cycle. A deterministic synchronous sim cannot reproduce it (no async-completion ordering) — the devpod is the only arbiter.
This reframes the question from bug to gate posture, and it is the convergence lesson worth carrying: when the same principle needs a fourth application “one granularity down,” the model — not the granularity — is the question. ADR-0045 promises only that bindings count; a fixed point through the in-flight dwell is asking for something the model deliberately does not promise (a machine mid-bootstrap is not yet a binding, so there is nothing for an incumbency pin to anchor to below the binding layer). The two honest options:
- (A) Amend ADR-0045 so “recently-bound” counts as a binding — a bounded second ledger, crossing the author’s core line.
- (B) Accept that the model does not promise a through-dwell fixed point, so a zero-reclaim gate over-specifies; relax it to bounded-reclaim — marginal reclaim/reacquire under continuous churn plus the ~3-cycle bootstrap dwell is correct capacity movement, not a defect (the ADR-0043/YAGNI read).
(B) is what shipped (commit 7921b07, ADR-0035 amended): the dev-50 reclaim gate was de-tailed to the settled window (settleSeconds) — the prior metric opened its window the instant “steady” was declared while the fleet settled 1–2 more minutes, so the settling tail dominated the integral (~340 ≈ 2–3× the settled rate) — and bounded (maxReclaimActionsDuringSoak), not asserted at zero. Field-validated: reclaimActionsDuringSoak 340 → 9 ≤ 150. The engine was untouched; this is not “fixing the gate to pass” but correcting a measurement bug and recognising what the model promises. (A) remains an author fork, not taken autonomously — it is the exact class of over-engineering the ADR-0045 dialogue corrected three times.
The final mental model
- One principle, four granularities. A served gang’s placement follows its bindings, never leads them. Applied at: the crediting site (ADR-0040, unify acquisition and crediting semantics); the folding of sub-machine gangs out of
Sameentirely (ADR-0041); the domain choice (ADR-0042 unsatisfiable-sticky; ADR-0051/M77g satisfiable gang-own tiebreak); the machine set within a domain (M77hincumbentFirst). Within a machine there is nothing finer. ChooseSameBucket’s total order (samebucket.go:38-198), top to bottom: satisfiable > (rule 2) greatest capped creditable coverage > (rule 2b/M77g) greatest capped gang-own coverage > (rule 3) smallest joint Total > (rule 4) most-covering when none satisfiable > (rule 5/ADR-0042) greatest creditable coverage among equal-coverage unsatisfiable buckets > count > lexicographic value. Every rule is stateless and reads current state.- The attribution metadata is the load-bearing addition, and it is purely additive:
AssignedNeedFingerprint(M72) +AssignedGroup(M77g) on the binding, twoshard_metadatakeys the provider store-and-echoes — no proto/RPC change, no second ledger, no cross-cycle memory. The chooser distinguishes “my gang’s machine” from “an equal unrelated machine” because the binding now records which gang it serves. - The perturbation lives below the binding layer. The ~3-cycle async bootstrap dwell moves the acquirable snapshot every cycle. The incumbency pins (domain and machine) make the bound set a fixed point through that dwell — but the irreducible residual is async self-perturbation the model, by design, does not promise to eliminate. That residual is bounded, not zero, and the gate is calibrated to it.
- Methodological residue, equally load-bearing. The synchronous sim self-damps and produces false greens against this entire class — the devpod/field run is the arbiter (ADR-0051 ships sim dwell-fidelity to narrow this). And ADR-0043 stands as the saga’s most expensive lesson: the ADR-0042 parking layer was rigorous mechanism built against demand a single catalog archetype fabricated. Demand realism before mechanism.