ADR-0037: Scaletest catalog node-affinity dimensions must be realistic — drop synthetic team/app label axes
Status
Accepted, 2026-05-20.
Context
M35 (“label-axis fingerprint multiplier”) added a labelAxes field to the
scaletest workload-archetype catalog. Its goal was to multiply the number of
distinct demand “fingerprints” the harness presents, so BigFleet’s
MatchProfile machinery is exercised against production-shaped cardinality
rather than the handful of fingerprints that instance-type × zone alone
produce.
The realistic catalog (test/scaletest/profiles/archetypes/realistic.yaml)
declared labelAxes on its two highest-weight archetypes:
tiny-stateless(70 % of demand):scaletest.bigfleet/team× 40,scaletest.bigfleet/app× 20 → 800 distinct (team, app) buckets.cpu-service(12 %):scaletest.bigfleet/team× 10.
The harness implements labelAxes by emitting them as required Pod
nodeAffinity (load-driver’s MatchExpressions under
RequiredDuringSchedulingIgnoredDuringExecution), and by stamping the same
key/value pairs onto seeded machine Profile.Labels. The CR’s Requirements
are then derived from the Pod’s node-affinity by the unschedulable-pod
controller, so the axes reach BigFleet’s matching as intended.
But nodeAffinity is also what kube-scheduler uses to place Pods on Nodes.
bigfleet-uber #41 measured the consequence directly from kube-scheduler’s
/metrics on the canonical catalog: the NodeAffinity filter plugin rejected
98.6 % of placement attempts (322,846 rejects / 4,539 passes), versus
68 % on a catalog with no labelAxes. Each tiny-stateless Pod requires a
Node carrying its specific (team, app) — one of 800 buckets — so the
overwhelming majority of Pods have no matching Node at all, and steady-state
bind plateaued at 9.5 % of target. The synthetic axis cardinality, not the
substrate, was the gate.
The modelling error is the conflation. team and app are Pod labels in
the real world — ownership and cost-allocation metadata. Production Pods do
not declare nodeAffinity onto team/app: real node affinity selects
instance-type, zone, GPU model, CPU architecture, and similar hardware
attributes. Routing a synthetic cardinality multiplier through the Pod’s
scheduler-visible nodeAffinity made kube-scheduler reject realistic
placements that a real cluster would accept. It is an artifact, not realism.
Decision
The scaletest catalog’s node-affinity dimensions must mirror realistic
production node affinity: instance-type, zone, and hardware attributes only.
Synthetic ownership axes (team, app) are not valid node-affinity
dimensions.
The labelAxes blocks are removed from realistic.yaml. The labelAxes
mechanism (archetype.LabelAxis, Archetype.PickLabels) is retained as a
general capability — a future catalog may legitimately use it for a real
node-affinity axis such as CPU architecture or GPU model — but the realistic
catalog declares none.
Consequences
kube-schedulerNodeAffinityrejection drops back toward the instance-type × zone baseline; steady state becomes reachable, which is the precondition for ADR-0035’s steady-state SLO measurement.- BigFleet’s demand-fingerprint cardinality drops to what instance-type × zone produces across the catalog (~75 distinct fingerprints). This is accepted: that is the cardinality a real fleet’s node-affinity actually presents. ADR-0017’s concern — per-CR binding latency versus fingerprint fan-out — still holds; this ADR only sets the cardinality to a realistic level rather than a synthetic one.
- M35’s label-axis multiplier no longer affects the realistic catalog. If a
future need to stress BigFleet’s matching beyond realistic node-affinity
cardinality arises, it must be modelled somewhere other than Pod
nodeAffinity— and justified against a real workload pattern first.