ADR-0033: Phase 1 supply-credit must respect bind readiness, not just provider state
Status
Rejected, 2026-05-19. Superseded by ADR-0035.
This ADR was Proposed on 2026-05-18 under the framing that Phase 1’s aggregate-supply credit produces a measurable bind plateau under the realistic catalog. Subsequent diagnostic work (see Postmortem below) established that:
- The bind plateau is a kube-scheduler property under high label-cardinality workloads, not a BigFleet behaviour.
- The plateau only manifests during ramp, not steady state.
- The scaletest harness was treating ramp percentage as a pass/fail SLO, which was the actual problem — ramp behaviour is not an SLO under [ADR-0014] / [ADR-0017] / ADR-0028.
The fix is therefore in the harness (ADR-0035 reshapes the scaletest to measure steady-state SLOs under churn, with pre-seeded inventory removing the ramp entirely), not in Phase 1 or its supply-credit math. No code from this ADR was shipped.
The Context, Goals, Decision, and Migration sections below are preserved unedited for the historical record.
Postmortem
The misdiagnosis chain that produced this ADR is worth recording so the same pattern is recognised faster next time.
Trigger: the canonical uber-5k scale-test reported a 43 % bind ramp ceiling, dropped from a previously-recorded 99 % pass. The drop was treated as a regression.
First misdiagnosis: the regression was attributed to a Configuring → Idle race in pkg/shard/reconcile.go (commit 10986a6) that had been masking the “real” Phase 1 behaviour by accidentally over-provisioning. After the race was fixed, the plateau settled at ~43 %. This was treated as the regression’s true magnitude and the next thread sought to recover the missing throughput.
Second misdiagnosis (this ADR’s OC1 framing): I proposed that Phase 1’s aggregate-supply credit was over-counting because Configured machines absorb demand at the bind path’s pace, not instantaneously — and that a bind-ready signal from the operator would correctly throttle credit. ADR-0033 captured this, with OC2 (time-gated) and OC3 (static slack) as alternatives.
Third misdiagnosis (the apiserver-CPU thread): a diagnostic ran and named the kwok apiserver REST handler at 8 vCPU as the gate. A CPU-bump test (8 → 16 → 32) followed and produced no change. The apiserver had never been CPU-saturated; the diagnosis was wrong in a way the original brief’s pre-committed decision matrix had biased toward.
Fourth (correct) diagnosis: a higher-agency follow-up profiled the actual scheduler. The gate was kube-scheduler’s serial per-Pod scheduling cycle, with 85 % NodeAffinity rejection rate against the realistic catalog’s label cardinality. This is a property of kube-scheduler under any K8s deployment, not specific to kwok or BigFleet. OC3’s apparent benefit (recorded during the investigation) came from over-provisioning reducing scheduler preemption-walk overhead — a kwok-substrate quirk, not a BigFleet algorithmic gap.
The real question — never asked until late: “Is this bind plateau an SLO regression, or just a ramp behaviour we shouldn’t be gating on?” Treating ramp percentage as if it were a per-CR binding-latency SLO led to weeks of substrate / algorithm investigation that wouldn’t have happened if the methodology question came first.
Lessons:
- Ramp behaviour is not an SLO. Capacity-exploration metrics and SLO metrics serve different purposes; conflating them produces investigation rabbit holes that don’t serve the user. ADR-0035 establishes the discipline.
- High-agency briefs disambiguate faster than over-specified ones. The first three diagnostic briefs in the chain over-specified what to capture and pre-committed decision matrices, which biased the data collection. The fourth brief gave the operator free choice of tools and matrices, and it landed the correct diagnosis in one round-trip.
- The “what should this code be solving?” question should precede “how should this code solve it?”. OC1, OC2, OC3 were all clever answers to a wrong question.
No further work on Phase 1 supply-credit is planned. Phase 1’s existing aggregate-supply math (per ADR-0027) is correct in steady state, which is what ADR-0035 requires the test to measure.
(Original ADR content preserved below for the record. Originally Proposed on 2026-05-18.)
Context
ADR-0029 redesigned Phase 1 as an Omega-style OCC. The pre-pass
SeedConfiguredSupply (pkg/decision/occ/seed.go) walks each
priority-sorted Need’s matching Configured and Configuring
machines and subtracts EffectiveAllocatable from the Need’s
AggregateResources until either supply exhausts or demand is
covered. Anything left over becomes the Need’s deficit, which the
OCC worker pool then tries to satisfy from Idle or Speculative
inventory.
The pre-pass treats a Configured machine as supplying its full
nominal allocatable the instant it reaches Configured state. That
matches the provider’s view of reality and was correct semantics
until the realistic catalog (ADR-0032) exposed the next layer of
the system: kube-scheduler’s bind rate is the steady-state gate, not
Phase 1.
The empirical chain that surfaced this (see
project_lessons_learned.md “M46 OCC validation”):
- uber-5k bigfleet-uber #20 at SHA
c955a0dreported 99.1 % bind ramp on the realistic catalog. Looked like an OCC win. - #22–#25 follow-ups showed the 99.1 % was a confound: a
Provision → Idlerace inpkg/shard/reconcile.go:applyReconciledMachinewas forcing ~25 % of Provisions to fire twice. Those redundant Provisions accidentally kept the bind path fed. - M48.1 (commit
10986a6) closed the race. Ramp dropped to ~44 % on the same SHA family because Phase 1 now provisions to exact nominal demand. - bigfleet-uber #25 diagnosis identified the actual bind-path gate at uber-5k 2-host: ~30 Pods/s aggregate, driven by UpcomingNode write rate to inner kwok apiservers (operator-side ~0.84/s aggregate creates).
- M48.4 (commit
29c7a2e) dropped operator-sidestatus_updateconflicts from 0.29/s → 0/s viaclient.MergeFrompatches, and cut NodeStateUpdate Ready p99 by 69 % — but the ramp ceiling stayed at 42.8 % because Phase 1 is also part of the gate.
The mechanism is simple:
- What Phase 1 sees: N
Configuredmachines, each contributing e.g. 52 Pod-shaped slots → 52N capacity available right now. - What kube-scheduler can actually do: bind at ~30 Pods/s system-wide, capped by NodeStateUpdate write throughput downstream of BigFleet.
- Net effect: once enough
Configuredmachines exist to cover aggregate demand, Phase 1’sabsorbed_by_supplyoutcome dominates andemitted_specfalls to ~0/s. Provisioning stops. But binding is still proceeding at 30/s, so many cycles later the bind path is still chewing through the queue while no new capacity arrives. The system ramps to ~43 % bind-fraction-of- demand and plateaus because that’s what aggregate demand divided by the time the system was provisioning happens to converge to.
The race accidentally fixed this by producing ~25 % more
Configured machines than strictly necessary; closing the race
exposed the real bug, which is that Configured machines should
not contribute their full allocatable to Phase 1’s
supply-credit until the bind path has demonstrated it can consume
that supply.
bigfleet.md §8 specifies Phase 1 as “walk needs top-down by
priority. Prefer Idle. Fall back to Speculative.” Neither the
paper nor ADR-0029 is wrong about Phase 1 — what’s wrong is the
pre-OCC creditExistingSupply heuristic that was carried into
ADR-0029’s SeedConfiguredSupply unchanged. The heuristic is fine
for an inviscid bind path; it over-credits whenever bind is
viscous.
Goals
- Eliminate steady-state over-credit. When the bind path is
the gate, Phase 1’s
absorbed_by_supplyoutcome must under-count by exactly the amount ofConfiguredcapacity that has not yet been demonstrated bindable. Phase 1 should continue toemit_specuntil either demand is covered or the realised bind-ready supply covers it. - No regression at cold-start. A fresh cluster with zero
Configuredmachines must still emit Provisions for every priority-sorted Need exactly as today. - No regression in the inviscid regime. A bind path that
isn’t the gate (e.g. dev-50, small uber-5k catalogs) must still
credit
Configuredmachines and avoid double-provisioning. - Preserve ADR-0027 stage 5.1 attribution. Phase 1 and Phase 3 must continue to attribute supply identically; if Phase 1 stops crediting an unconfirmed-bindable machine, Phase 3’s reclaim mirror must also stop crediting it.
- Preserve the cost formula and hard rules. No pluggable cost. No cluster-side interruption override. The change is purely to which machines participate in supply-credit, not how their cost is computed.
Non-goals
- Increasing the bind path’s throughput. That’s downstream of BigFleet (kwok / kube-scheduler / kine / apiserver). M44.4 already iterated on the harness side; further work belongs in the scaletest harness, not BigFleet itself.
- Adapting Phase 1 to per-cluster bind-rate signals. Phase 1 is one shard’s view; cross-cluster bind rates aren’t directly observable from the shard. The signal we adopt must be a property of the individual machine, not an aggregate rate.
- Changes to
bigfleet.md§8 or ADR-0029’s OCC structure. OnlySeedConfiguredSupply’s admission rule changes. - Changes to Phase 2 / Phase 3 victim selection — only the attribution data they read.
Decision
Adopt OC1: bind-ready supply-credit. A Configured machine
contributes to SeedConfiguredSupply only after the cluster
operator confirms the machine’s UpcomingNode has reached
Phase = Ready (i.e. the fake or real Node exists and
kube-scheduler can place Pods on it). Until then the machine
exists, is in Configured state, and is owned by a cluster — but
Phase 1 treats it as having zero available capacity for the
supply-credit pre-pass. The OCC worker pool continues to find
candidates from Idle / Speculative and Provisions as needed.
Why OC1 over OC2 / OC3
- OC2 (time-since-Configured rate gate) — credit
min(EffectiveAllocatable, k × age)wherekis a per-cluster bind-rate constant. Cheaper to implement (Machine gains aConfiguredAt time.Time; no stream message changes) but fundamentally a proxy: a machine that takes 60 s to become bind-ready and one that takes 2 s are treated identically. The constantkis also workload-dependent — uber-5k binds at 30/s system-wide; uber-50k won’t be the same. Picking a constant we have to retune at each rung is exactly the kind of speculative configurability CLAUDE.md /feedback_yagni.mdrules out. - OC3 (slack factor) — Phase 1 over-requests by N % to recreate what the race accidentally provided. Simplest to implement but violates the “Phase 1 emits to exact demand” invariant the rest of the system relies on (e.g. Phase 3’s reclaim mirror, ADR-0027 stage 5.1 attribution). It also has no principled stopping point: the right N depends on bind rate vs demand rate at each scale. Slack factor is the answer if the bind path were a fixed bottleneck we accept; it isn’t — it’s a property we want to observe through to Phase 1, not paper over.
OC1 is the only option that closes the loop with a real signal rather than a tuning knob. The signal already exists in the operator’s UpcomingNode reconciler (we know when Phase becomes Ready — that’s exactly what M48.4 was about); we just don’t currently reflect it back to the shard.
Mechanism
- Machine gains a
BindReady boolfield. Persists across reconcile cycles. False by default; set true once the operator confirms UpcomingNode.Status.Phase = Ready. - New stream message:
NodeBindReady(operator → shard, multiplexed onShard.Sessionper the hard rule of no inbound listeners on the operator). Carriesmachine_idand aconfirmed_attimestamp. Idempotent — re-delivery is a no-op.supersedes_key = machine_idper the §0.1 coalescing rule. SeedConfiguredSupplyadmission rule: skip anyConfiguredmachine whereBindReady == false.Configuringmachines remain admitted as today (they’re already pre-credit for the in-flight Configure RPC; double-restricting would regress the inviscid path).- Phase 3 reclaim mirror: applies the same admission rule.
A non-BindReady
Configuredmachine cannot be reclaimed via the supply-credit attribution path; Phase 3 sees it as “claimed by no Need” exactly as Phase 1 does, and Phase 3’s existing logic for orphaned-Configured handles it. - Reconcile interaction: BindReady is set by the operator
stream, not by
provider.List.applyReconciledMachinepreserves the existingBindReadyvalue across reconciles (analogous to today’sAssignedPriority/AssignedInterruptionPenaltyDollarspreservation inreconcile.go:126-129). - Failure / cluster-loss handling: if the operator disconnects, BindReady is not cleared — the machine’s bind- readiness in the Kubernetes data plane is independent of the stream’s liveness. If a machine transitions out of Configured (Drain, Delete, Fail), BindReady is cleared as part of the state transition.
Cold-start behaviour
A fresh cluster has zero Configured machines, so
SeedConfiguredSupply credits nothing, and Phase 1 Provisions
every Need from Idle / Speculative. As Configures complete and
the operator confirms Ready, BindReady flips true machine-by-
machine and supply-credit ramps with bind capacity. This is the
desired behaviour and matches what the race was accidentally
producing.
Validation
Two-rung validation, mirroring ADR-0029’s pattern:
- uber-5k 2-host: bind ramp must reach ≥95 % within the
realistic-catalog budget. The 42.8 % ceiling observed at
29c7a2eis the regression bar; anything below ~95 % is a failure of the design, not a tuning issue. - uber-50k: bind ramp must reach ≥95 %; per-cycle wall-clock p99 ≤ ADR-0028’s envelope. Per the uber ladder discipline (no pre-filing above the in-flight scale), runs at uber-500k+ wait for Uber approval before the validation arc continues.
absorbed_by_supply rate must remain non-zero in steady state
(i.e. the bind-ready pool eventually catches up and the system
reaches an equilibrium where some cycles credit existing supply
rather than emitting). A pathology where absorbed_by_supply
stays at 0 indicates the bind path is permanently capped — that’s
a downstream problem, not a Phase 1 regression.
Alternatives considered
OC2: time-since-Configured rate gate
SeedConfiguredSupply credits min(EffectiveAllocatable, k × age) where k is a configurable bind-rate constant per cluster.
Pros: no new stream message; no operator-side change.
Implementation is one new field on Machine
(ConfiguredAt time.Time) and a few lines in seed.go.
Cons: k is a per-workload tuning knob the operator can’t
introspect. Wrong k either over-provisions (low k) or under-
provisions (high k); the latter just reproduces today’s bug at
a different rate. No principled way to set the default.
OC3: deliberate slack factor
Phase 1 multiplies each Need’s AggregateResources by 1 + slack
(e.g. 1.25 to recreate the ~25 % race effect) before passing to
SeedConfiguredSupply. The extra demand spills past available
supply and Phase 1 emits Provisions for it.
Pros: smallest diff. No new field, no new message. One multiplication.
Cons: violates “Phase 1 emits to exact demand”
(ADR-0027 / ADR-0029 both assume this). Breaks Phase 3
attribution unless Phase 3 also adopts the same factor —
ADR-0027 stage 5.1 invariant is fragile to asymmetric supply
accounting. The right slack is workload-dependent and would
need re-tuning at every uber-* rung. It’s a knob, not a fix.
Status-quo + downstream tuning
Leave Phase 1 alone; bump UpcomingNode write concurrency, kine write throughput, etc. until the bind path is no longer the gate.
Pros: respects layering — Phase 1 isn’t the proximate cause. Cons: the harness limits we’d be tuning around are properties of the scaletest infrastructure (kwok / kine / apiserver), not BigFleet. We don’t ship those. And the next user to run BigFleet against a slow customer apiserver hits the same wall.
Hard rules touched
This ADR adds one new stream message (NodeBindReady) on
Shard.Session. Per the §0.1 decisions:
- Operator remains outbound-only — the new message is multiplexed on the existing operator-initiated bidi stream, not a new RPC.
supersedes_key = machine_idper the coalescing rule, so reconnect ordering is safe.- Conformance: the message is a hint, not a correctness
requirement. A provider / operator that doesn’t emit
NodeBindReadysimply has everyConfiguredmachine stay atBindReady = false→ behaviour reverts to “Phase 1 always Provisions, never credits”. That’s the cold-start path; it’s correct, just under-efficient. Conformance suite covers the emit path; absence is not a failure.
No other hard rule is touched. Cost formula unchanged. Provider RPC surface unchanged (this is operator ↔ shard, not provider). Static stability preserved (BindReady is set by the operator stream, but its absence falls back to “Provision more” — the shard remains autonomous during coordinator failover and during operator disconnect).
Migration plan
Layered behind a BindReadyCredit shard config flag, default
false in the first commit. Once validated at uber-5k and
uber-50k, flip to default true and remove the flag in a
follow-up.
- Stage 0: Wire format. Add
NodeBindReadytoshard.proto’sShard.Sessionmessage types. - Stage 1: Domain types. Add
Machine.BindReady bool; plumb through proto conversion inpkg/api/conv; preserve across reconcile. - Stage 2: Operator emit. When the UpcomingNode reconciler
observes Status.Phase → Ready, emit
NodeBindReadyon the shard stream. - Stage 3: Shard ingest. Handle
NodeBindReadyin the shard’s stream loop; flipBindReady = trueon the inventory machine. Idempotent. - Stage 4:
SeedConfiguredSupplyadmission rule, gated on theBindReadyCreditconfig flag. Phase 3 mirror. - Stage 5: uber-5k bigfleet-uber brief; validate ≥95 % ramp.
- Stage 6: uber-50k bigfleet-uber brief; validate ≥95 % ramp at scale.
- Stage 7: Default the flag to
true. Remove the conditional after one more clean validation pass.
uber-500k+ remains gated on prior Uber approval per the standing
policy (project_uber_scale_ladder.md).
References
- ADR-0027 Roll-up demand is a constrained aggregate resource request.
- ADR-0028 Cycle-p99 SLO is regime-parametric.
- ADR-0029 Phase 1 Omega-style OCC.
- ADR-0032 Realistic catalog production-calibrated workload distribution.
project_lessons_learned.md§ “M46 OCC validation: c955a0d’s 99.1 % ‘pass’ was race-induced over-provisioning”.- bigfleet-uber #25 (bind-path gate diagnosis) and #26 (M48.4 validation).