ADR-0038: Scaletest workloads are controller-managed objects, not bare Pods
Status
Accepted, 2026-05-20.
Context
The scaletest load-driver creates workloads as bare Pods — corev1.Pod
objects with no owning controller. The unschedulable-pod controller then
creates a CapacityRequest per Pod, owner-referenced to the Pod, so demand
tracks the Pod population.
bigfleet-uber #45 traced an unbounded supply-thrash to this. BigFleet’s
Phase 3 correctly identifies genuine excess Configured supply (the seed’s
archetype mix never matches the realized demand draw exactly — ~25 surplus
machines per cluster in the uber-5k catalog) and reclaims it. Reclaim drains
the machine: the operator evicts every Pod on the machine’s node. Eviction
deletes a bare Pod — and there is no controller to recreate it. The
deleted Pod’s CapacityRequest cascade-GCs with it, so demand permanently
shrinks by the machine’s Pod count. The next Phase 3 cycle sees even more
apparent excess and reclaims more, which destroys more demand. The result is
a self-sustaining ~26 machines/sec Bootstrap+Reclaim cascade against a demand
that never actually changed — observed across briefs #38–#44, mis-labelled
“steady-state churn” until #45 pinned it.
The cascade is unbounded only because bare Pods do not survive eviction.
In a real cluster every workload is managed by a controller — Deployment
(via ReplicaSet), StatefulSet, Job. When an autoscaler drains a node,
the evicted Pods are recreated by their controller and rescheduled onto the
remaining capacity. Demand is conserved; the autoscaler’s scale-down
self-arrests at the true surplus. Draining nodes that carry workload is the
correct, central behaviour of an autoscaler — the bug is not that BigFleet
reclaims machines with Pods on them, it is that the harness modelled
workloads in a way that makes eviction destroy demand.
A harness built on bare Pods also never exercised the real drain → controller-recreates → reschedule path that production has, so it could not have surfaced steady-state behaviour faithfully even without #45.
Decision
The scaletest load-driver creates workloads as real controller-managed objects, and the kwok cluster runs the controllers that reconcile them:
- Stateless archetypes (
tiny-stateless,cpu-service,cpu-batch,gpu-inference,gpu-training-*,critical-realtime) →Deployment. - Stateful archetypes (
stateful-db,memory-cache) →StatefulSet(novolumeClaimTemplates— the harness does not model storage; the StatefulSet is used for its stable-identity / ordered semantics only). - One workload object per archetype fingerprint per cluster;
replicasdrawn from a service-size distribution, summed to the per-cluster Pod target. A workload object’s replicas share one shape — which is itself more realistic than the previous per-Pod resource re-draw. kube-controller-managerin the kwok apiserver container runs thedeployment,replicaset, andstatefulsetcontrollers in addition to the existinggarbage-collectorandnamespacecontrollers.
cpu-batch’s finite-lifetime behaviour stays modelled by lifetime-aging
(delete the Pod; the ReplicaSet recreates it — a fresh batch run). A
Job mapping is a possible future refinement but is out of scope here.
Eviction becomes non-destructive: when Phase 3 reclaims a machine, the operator’s drain evicts the Pods, their controllers recreate them, the recreated Pods reschedule onto the remaining capacity, and demand is conserved. Phase 3 reclaims the genuine surplus once and self-arrests.
Consequences
- The load-driver’s hand-rolled Pod-count maintenance (
reconcileTargetacting as a pseudo-ReplicaSet) is replaced by real controllers. Churn becomes “delete a Pod; its controller recreates it”; scale becomes “adjustreplicas”. kube-controller-managerdoes more work per kwok cluster — a modest CPU cost, scoped to three added controllers. Substrate budgets may need a small bump; measured at validation.- Pre-bind still operates on Pods — the Pods are now created by the
ReplicaSet/StatefulSetcontrollers rather than the load-driver directly, but the Unschedulable → UPC → CapacityRequest → bind chain is unchanged. - This is a precondition for ADR-0035’s steady-state methodology to hold: without controller-managed workloads, every Phase 3 reclaim permanently destroys demand and steady state is unreachable by construction.
- No BigFleet (
pkg/decision,pkg/operator) change. #45’s verdict proposed making Phase 3 avoid machines with bound Pods, or skipping eviction on Reclaim — both rejected: they would remove BigFleet’s core ability to scale a running cluster down, and BigFleet structurally does not (and must not) know per-machine Pod counts. The fault was the harness’s workload model, and the fix belongs there.