Skip to content

ADR-0039: One CapacityRequest per Pod — not per *unschedulable* Pod

Status

Accepted, 2026-05-21.

Context

BigFleet’s demand signal is the operator roll-up, aggregated from CapacityRequest objects. The reference per-pod controller (pkg/controller/cr, the bigfleet-unschedulable-pod-controller) creates a CR only for Pods it observes in PodScheduled=False, reason=Unschedulable.

The diagnostic chain bigfleet-uber #45 → #48 traced the sustained ~32 machine/sec Bootstrap+Reclaim cascade on stable demand to this. #48 measured it directly: ~84 % of bound Pods carry no CR. A traced Pod was scheduled 13 s after creation having never been marked Unschedulable, so the controller never created a CR for it. Two paths bypass Unschedulable entirely:

  • the scale-test pre-bind fast-path binds Pods via the Bind API before the scheduler ever marks them Unschedulable;
  • after ADR-0038, a controller-recreated Pod binds straight onto capacity Phase 1 has just Bootstrapped, again skipping Unschedulable.

The consequence is asymmetric. Phase 1 (Bootstrap) is correct on an unmet-demand signal — “what have I not yet satisfied?” is unmet demand. Phase 3 (Reclaim) is structurally broken by it — “do I have excess supply?” requires total demand. Phase 3 reads CR-count per archetype as its demand proxy; with CRs undercounting total demand ~6×, it sees a permanent phantom surplus and reclaims into it every cycle. Each reclaim evicts real bound Pods; ADR-0038’s controllers recreate them; they re-acquire CRs only transiently — sustaining the cascade.

The papers are unambiguous on the intended model:

  • Fleet-Scale Kubernetes §6.1: “One CR per pod.”
  • BigFleet §13 (scale-down): “CR garbage-collected via ownerRef → next roll-up has fewer needs → Phase 3 reclaims.” This mechanism only works if a running Pod has a CR — otherwise deleting a bound Pod changes nothing in the roll-up.

The same papers’ §12 describes the reference controller as one that “watches PodScheduled=False, reason=Unschedulable.” That mechanism yields one-CR-per-pod only under an implicit assumption: that every Pod is Unschedulable at birth — true when a BigFleet-managed cluster runs at capacity, so a new Pod has nowhere to go until BigFleet provisions. When that assumption breaks (a Pod binds onto spare capacity, or is pre-bound), §12’s mechanism under-produces CRs and the §6.1 / §13 invariants are violated. §6.1 and §13 are load-bearing; §12’s Unschedulable filter is an implementation detail that is only incidentally correct.

Decision

The reference per-pod capacity-request controller creates a CR for every Pod, not only those in reason=Unschedulable. It honours the §6.1 contract — one CR per Pod, for the Pod’s lifetime — directly, regardless of whether the Pod ever transits Unschedulable.

The CR remains owner-referenced to its Pod and is garbage-collected when the Pod is deleted — withdrawal stays implicit (§13), unchanged.

No pkg/decision or operator change. With the demand signal now complete, Phase 3’s existing claimMatching arithmetic is correct: total demand vs Configured supply yields the true surplus, and Phase 3 reclaims it once and self-arrests.

Consequences

  • The roll-up carries the cluster’s total desired capacity, matching BigFleet §1 / CLAUDE.md (“roll-ups are the cluster’s complete desired state”) and the author’s stated rule (“total capacity, not the extra, or constant thrashing”). The #45–#48 reclaim cascade ends.
  • CR object count rises to ≈ one per Pod (previously only the unmet-demand fraction, ~16 %). The roll-up wire size is unaffected — ADR-0027 already made it a constrained aggregate, independent of Pod count. The cluster apiserver’s object count roughly doubles (Pods + CRs); measured at validation.
  • Paper divergence recorded: Fleet-Scale Kubernetes §12’s “watches PodScheduled=False, reason=Unschedulable” is superseded — the controller watches all Pods. §6.1 (“one CR per pod”) and §13 (scale-down) are unchanged and are the invariants the controller now satisfies unconditionally.
  • The controller’s name (bigfleet-unschedulable-pod-controller) is now a historical misnomer. Renaming touches the binary, chart, metric, and docs; it is cosmetic and deferred — not done here.