ADR-0025: The load-driver anchors sameRack groups — a gang-scheduler stand-in
Status: Accepted
Date: 2026-05-14
Context
ADR-0024 made the scaletest harness emit real podAffinity on sameRack-archetype pods, so the unschedulable-pod-controller translates it into CapacityRequest.Spec.CoLocation and the operator turns it into a Same requirement at roll-up. This exercises the co-location path end-to-end through real Kubernetes objects.
The kind dev-500 validation of ADR-0024 surfaced a problem: the run reached 92% binds and stalled — it could not clear the ramp gate. The ~15% of pods from the sameRack archetypes (gpu-training, memory-db) never bound.
The cause is a documented Kubernetes limitation, not a kwok artifact or a BigFleet bug. A sameRack pod carries a self-referential requiredDuringSchedulingIgnoredDuringExecution podAffinity: “schedule me on a rack that already has a pod from my group.” When a whole group is created at once into a cluster with no running peers, the first pod can never schedule — kube-scheduler’s InterPodAffinity filter requires an already-assigned peer pod, and there isn’t one. The Kubernetes pod-affinity design proposal states this directly:
“if all pods in service S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule will block the first pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from the same service.”
The same document proposes a scheduler self-match exemption (“if the requirement matches a pod’s own labels, and there are no other such pods anywhere, then disregard the requirement”) — but it is listed as a proposed short-term fix only and was never implemented. Current kube-scheduler has no such exemption; the bootstrap deadlock is real on a real cluster too (e.g. k3s-io/k3s#9350).
Production fleets don’t hit this with raw podAffinity. Strict gang co-location is handled in a layer above the autoscaler: gang schedulers (Volcano, the coscheduling plugin, Kueue’s topology-aware scheduling) place the whole group atomically; or a leader-anchor pattern is used (MPIJob’s launcher schedules unconstrained, workers podAffinity onto it); or placement-group labels turn it into nodeAffinity. The autoscaler’s job is to provision the co-located capacity — the atomic placement onto it is a scheduler-layer concern.
This puts the harness in tension with ADR-0023, which deliberately removed harness-side pod binding (pod-shim’s blanket /binding) because “the harness’s choice of scheduler is the dominant variable in the published numbers.” We need a way to let sameRack pods bind without either (a) reverting ADR-0024’s real-podAffinity modelling, or (b) silently re-introducing harness binding without justification.
Decision
The load-driver force-binds one anchor pod per sameRack co-location group, playing the role a gang scheduler plays in a real fleet.
A reconcile loop in the load-driver, running on a short interval:
- Lists pods carrying the
scaletest.bigfleet/co-location-grouplabel, grouped by that label’s value. - For each group with pending pods but no pod yet bound to a node (no anchor): finds a fresh fake-Node — one not yet claimed by any co-location group — whose
Allocatablefits the group’s pod shape. - Force-binds one pending group pod onto that node via the
Bindingsubresource (the typedclientset.CoreV1().Pods(ns).Bind()path — controller-runtime’s cache layer has known issues with the binding subresource, per pod-shim’s experience).
kwok then marks the anchor Running. Real kube-scheduler now sees a running peer with the group label and places the rest of the group onto the same rack via podAffinity, normally.
This is legitimate, not a credibility dodge. It is not the harness papering over a kube-scheduler weakness — kube-scheduler is behaving correctly; gang bootstrapping is genuinely outside its remit. The load-driver is standing in for the gang scheduler a real fleet would run above the autoscaler. It is the opposite of pod-shim’s blanket binding: pod-shim bound every pod, replacing the scheduler entirely; this binds one anchor per sameRack group (~15% of pods × 1/groupsize), and real kube-scheduler still does all the actual scheduling work.
BigFleet is entirely unaffected. It provisions capacity for the aggregated Same Need regardless of whether or where pods bind — the kind run confirmed it provisioned the co-located machines correctly. The anchor loop only moves the user-facing bind metric; it touches no BigFleet code path.
Consequences
What we gain
dev-500(and anysameRack-using profile) can clear its ramp gate again, while keeping ADR-0024’s real-podAffinitymodelling — the fullpod → UPC → CoLocation → operator → Same → shardpath runs end-to-end through real objects.- The harness models the real production topology: autoscaler provisions co-located capacity, a gang-scheduler-equivalent places the group atomically onto it.
What we lose / the caveats
sameRackpods’ bind metric is harness-assisted for one anchor per group — it is not 100% pure kube-scheduler. This must be stated when interpretingsameRackbind latency. The ~85% of non-co-located pods remain pure kube-scheduler (ADR-0023).- The anchor may land on a fresh fake-Node other than the specific machine BigFleet provisioned for that group’s Need. At density-100 with fungible machines this converges (the “displaced” machine is filled by other pods); it is a harness-accounting nicety, not a correctness issue.
- The reconcile loop lists co-location-labelled pods on an interval. Bounded by the
sameRackfraction (~15%), but at the largest profiles (scaleway-5m) this is worth revisiting with a field-selector / informer if it shows up in the apiserver budget.
What stays the same
- ADR-0023 holds: the harness runs real kube-scheduler, and it does all scheduling for the ~85% non-co-located pods and for the non-anchor members of every
sameRackgroup. - BigFleet —
pkg/shard,pkg/operator,pkg/decision, the UPC — unchanged. This is purely a load-driver addition.
Alternatives considered
- A —
sameRackpods become plain pods in Mode=pods (nopodAffinity); cover theSamepath via a Mode=cr profile. Simplest,dev-500goes green immediately. Rejected as the primary choice because it drops real-podAffinitymodelling from the Pod-mode path that every cloud profile uses — the UPC’spodAffinity → CoLocationstep would then only ever be unit-tested, never exercised e2e at scale. - B — keep
podAffinity, excludesameRackpods from the bind gate, verify the BigFleet side via provisioning metrics. Faithful, but it permanently leaves ~15% of the population unbindable in the harness and needs runner gate-query surgery to special-case them — more machinery for a less complete result. - C (chosen) — anchor one pod per group. Keeps real
podAffinity, keeps the full e2e path, and the binding is a faithful model of the gang scheduler a real fleet runs. - Revert ADR-0024’s harness change entirely — would mean Mode=pods never exercises co-location. Rejected for the same reason as A, more so.
References
- ADR-0023 — real kube-scheduler in the scaletest harness, retire pod-shim’s binding role. This ADR is the scoped, justified exception to it.
- ADR-0024 — co-location via
podAffinity; the change that surfaced this. - Kubernetes pod affinity design proposal — documents the first-pod bootstrap problem and the proposed-but-unshipped self-match exemption.
k3s-io/k3s#9350— a real-world instance of the same bootstrap deadlock.bigfleet-uberbrief #5 — the devpod-5k cloud re-run; devpod-5k uses no co-location, so it is unaffected by this ADR.