ADR-0023: Real kube-scheduler in the scaletest harness, retire pod-shim's binding role
Status: Accepted
Date: 2026-05-13 (Proposed) — 2026-05-14 (Accepted after cloud validation)
Context
The scale-test harness ships a custom binary, bigfleet-scaletest-pod-shim (at test/scaletest/cmd/pod-shim/), which performs three jobs inside each kwok-faked Kubernetes cluster:
- Watches
UpcomingNodeCRs, creates fake Nodes withAllocatableset per ADR-0022’s density model. - Watches Pods, marks them
PodScheduled=False reason=Unschedulablewhen no Node fits (so the unschedulable-pod-controller picks them up and creates aCapacityRequest). - Binds Pods to fake Nodes via the
/bindingsubresource, bin-packing to density.
Jobs 2 and 3 are kube-scheduler’s job in production. The harness wraps them inside pod-shim only because the original M43 sequence wanted a minimum-viable Pod → bound loop without taking on the integration of a real scheduler against kwok-Nodes. Pod-shim is a harness shortcut, not a design statement.
The shortcut has caught up with us. In the 2026-05-13 devpod-5k run on Uber infrastructure (5 clusters × 100K Pods), the diagnostic on issue #1 (private bigfleet-uber repo) showed:
- The BigFleet shard, operator, and rollup pipelines all sit comfortably inside their SLOs (cycle p99 ~127ms-1s, operator rollup p99 ~330ms-1.2s, zero shortfalls).
- The runner’s user-facing binding latency SLO (
internalBindingLatencyP99Seconds ≤ 20 s) is failed at 102.4 s p99. - The 102 s is entirely on pod-shim’s bind path: 256 binder workers saturated at >30K objects per cluster, individual
/bindingwrites slowing because pod-shim competes for apiserver writes with the load-driver on the same in-cluster apiserver.
Cluster-shape iteration (10×50K split — issue #2) confirmed: the bottleneck is per-cluster, and we can move the inflection point by sharding clusters smaller. But every variant just verifies the harness’s own scheduling stand-in is slower than a real scheduler would be, while telling us nothing useful about BigFleet’s behaviour at production-realistic scale.
The harness’s choice of scheduler is now the dominant variable in the published numbers. That’s the wrong shape for a credible scale-test result.
Decision
Replace pod-shim’s scheduling role (jobs 2 and 3 above) with a real kube-scheduler process running against each per-cluster kwok apiserver. Keep only job 1 (UpcomingNode → fake-Node creation), as a new small binary bigfleet-scaletest-node-creator.
Concretely:
-
Add
cmd/scaletest-node-creator, ~100-line binary. WatchesUpcomingNodeCRs via controller-runtime, creates fake-Nodes withAllocatableset per ADR-0022. Zero responsibility for Pods. -
Run real
kube-schedulerinside each per-cluster apiserver container (entrypoint-apiserver.sh). Already runningkube-apiserver,kube-controller-manager,kwok-controller; adding the scheduler is one more binary download and one more supervised process. Configure with aKubeSchedulerConfigurationthat usesNodeResourcesFitplugin withscoringStrategy: { type: MostAllocated }so it bin-packs toAllocatablerather than spreading — preserves ADR-0022’s density-100 model without explicit pod-shim placement code. -
Retire pod-shim (
test/scaletest/cmd/pod-shim/) once the new flow is validated end-to-end on kind + cloud. Until then keep it parallel, gated on aharness.schedulerprofile field with valuespod-shim(legacy) andkube-scheduler(new). Default flips tokube-scheduleronce a devpod-5k run on the new flow lands a verdict.
Consequences
What we gain
- Published binding-latency numbers are credible. What we measure is what a production fleet would measure: real kube-scheduler against the cluster the operator manages. If the soak SLO passes, it’s a true claim. If it fails, the failure is meaningful (a real-world per-cluster scheduler ceiling, not a harness limitation).
- One less custom binary to maintain. Pod-shim has been a source of bugs across M43, M44, M45 (SetupSignalHandler double-call,
IndexFieldcache-startup race, WatchList interaction, binder thundering herd). Removing it removes a class of failure mode from future scale-test work. - Density-100 placement comes from a configurable scheduler scoring plugin rather than hand-rolled bin-pack code. Tunable from outside without modifying harness binaries.
- Future scale-test harnesses (e.g. multi-tenancy, NodeAffinity-heavy workloads) work without extending pod-shim — they extend the scheduler config, the established Kubernetes path.
What we lose
- Explicit control over Pod-to-Node placement. Pod-shim’s
tryBinddeterministically picks the first fitting fake-Node. Real kube-scheduler scores all Nodes and may make different choices. For verifying ADR-0022’s density model holds, we need to confirm the resulting Pod distribution still pile-packs toAllocatablecapacity per Node. Probably fine withMostAllocatedscoring; needs explicit validation on kind before cloud. - One more process per kwok apiserver container. The container already runs apiserver + kine/etcd + kube-controller-manager + kwok-controller. Adding kube-scheduler is one more binary, ~50MB RAM idle. Should fit comfortably in current per-pod budgets.
- The transition window during which pod-shim and node-creator both exist. Two code paths, slight maintenance overhead. Limited by retiring pod-shim once new flow is proven.
What stays the same
- BigFleet itself: unchanged.
pkg/shard,pkg/coordinator,pkg/decision,pkg/operator— none affected. This is purely a harness change. - The chain shape: load-driver → Pods → (now-real-)scheduler-marks-Unschedulable → unschedulable-pod-controller → CR → operator → shard → Bootstrap → operator publishes UpcomingNode → node-creator creates fake-Node → real scheduler binds Pod. Same five stages between BigFleet and the test inputs, just with the scheduler stage now being the real thing.
- The published
scaletest-resultspage: continues to report BigFleet-internal SLOs (shard cycle p99, operator rollup p99, shortfalls) and the now-credible user-facing binding latency.
Implementation sequence
To stay reversible and reviewable:
- Add
cmd/scaletest-node-creatorparallel to pod-shim. No behaviour change yet. - Add
kube-schedulerbinary download +KubeSchedulerConfigurationtemplate + launch logic toentrypoint-apiserver.sh. Gated behind an env var so legacy runs are unaffected. - Add
harness.schedulerchart value, plumbed throughentrypoint-workload.shto either launch pod-shim (legacy) or skip it (new). - Update
dev-500profile withharness.scheduler: kube-scheduler. Validate ramp + soak both pass on kind under the new flow. Density-100 must hold — verify by inspectingbigfleet_scaletest_pod_shim_*_total/ equivalent node-creator metrics + actual Pod-per-Node distribution. - File a
devpod-5kbrief onbigfleet-uberto validate the new flow on Uber infrastructure. - Flip default
harness.schedulertokube-schedulerfor all profiles. - Delete
test/scaletest/cmd/pod-shim/. Update chart templates and profiles to droppodShim.*config.
This ADR moves to Accepted once step 5 returns a clean verdict.
Alternatives considered
- Tune pod-shim’s knobs (raise
binderConcurrency, raiseOPERATOR_QPS, etc.). Cheap, but only moves the bottleneck — doesn’t change the fundamental property that we’re measuring a custom binder, not a production-realistic one. - Shard clusters smaller (10×50K, 20×25K, etc.). Confirmed in issue #2 that this scales ramp throughput, but soak still fails on the harness’s binder. Same critique: the test’s outcome depends on a harness implementation detail.
- Optimise pod-shim into something approximating kube-scheduler. Significant engineering effort that ends with a worse copy of an existing well-tested binary.
- Raise the soak SLO to a regime where pod-shim doesn’t gate it. Hides the issue rather than fixing it. Published numbers would silently include pod-shim’s overhead. Rejected on grounds of result credibility.
References
- ADR-0017 (per-CR binding latency vs fingerprint fan-out latency) — establishes the chain stages we measure.
- ADR-0018 (internal vs user-facing binding latency) — the binding-latency dichotomy this ADR resolves more honestly.
- ADR-0022 (
Need.Countis Pod count) — the density model the new scheduler-config must preserve. bigfleet-uberissues #1 and #2 (private) — the empirical data driving this decision.