ADR-0015: Realistic archetype improvements — fingerprint multiplicity, bimodal lifetimes, concentrated bursts, `Same`-rack workloads, cluster-size skew
Status: Accepted
Date: 2026-05-06
Context
M31 introduced a six-archetype workload catalog (test/scaletest/profiles/archetypes/realistic.yaml) and re-shaped the load-driver + Configured seed to read the same file. That’s an honest improvement on M29’s single-shape harness, but an audit against what production fleets actually look like (Borg, Twine, observed Kubernetes-on-everything) surfaced significant remaining gaps:
| Axis | M31 catalog | Production reality | Gap |
|---|---|---|---|
| Profile fingerprints per cluster | ~6 archetypes × ~3 zones × ~2 instance types per pool ≈ 36 | 50-200+ per large cluster | 3-5× too few |
| Fingerprint cardinality fleet-wide | ~36 (every cluster identical) | thousands (per-team labels, app-version labels, etc.) | very large |
| CR lifetime distribution | uniform 5%/min churn on a fixed Target | bimodal: long-lived services (days) + short-lived batch (minutes-hours) | large — biggest realism gap after fingerprint cardinality |
| Demand burstiness | uniform churn across all clusters | concentrated in time AND in cluster (Friday deploy hits 5 clusters simultaneously) | large — the “1:10 burst regime” tests aren’t actually bursty |
| Topology constraints | none — no Same, no zone-spread | Same-rack pinning + pod-topology-spread are common in production | critical — the protobuf-only Same operator is paper-load-bearing and untested under load |
| Resource-quantity diversity | one fixed value per archetype | each archetype is itself a distribution (cpu-service: 1c-32c) | moderate — multiplies fingerprint cardinality |
| Cluster size skew | uniform: every cluster identical Target | heavy-tailed: 5 huge + 50 medium + 500 small | moderate — affects per-cluster Phase 3 cost distribution |
The takeaway: the M31 catalog is “more honest than M29 was, less honest than it claims to be.” Conclusions from runs against the M31 catalog about Phase 2/3 algorithmic fitness, cycle p99 budget, or CPU efficiency may not generalise to production. Before any further optimisation work or SLO claims rest on these tests, the harness needs to close the largest realism gaps.
Decision
Five extensions land together, each closing one of the named gaps. They share one design principle: every dimension that production fleets distribute over, the harness must also distribute over — not just “vary,” but “match production-shaped distributions.”
1. Per-archetype fingerprint multiplicity
Each archetype expands to N concrete profile fingerprints by combining its instance-type pool with size buckets and zones. A cpu-service archetype with 3 instance types × 3 zones × 4 size buckets generates 36 distinct profiles, drawn weighted-uniformly per CR.
Schema addition:
archetypes: - name: cpu-service sizeBuckets: - { weight: 40, cpu: "2", memory: "8Gi" } # most services - { weight: 35, cpu: "4", memory: "16Gi" } - { weight: 20, cpu: "8", memory: "32Gi" } - { weight: 5, cpu: "16", memory: "64Gi" } # rare big servicesWhen sizeBuckets is non-empty, the archetype’s top-level resources is ignored and per-CR resources are picked weighted-random from the bucket list. Catalog-wide this produces 100-500 distinct fingerprints per cluster — production-shaped without making the load-driver unmanageable.
2. Bimodal CR lifetimes
The current churnPerMinute: 0.05 model assumes every CR has the same expected lifetime. Real fleets are bimodal: long-running services live days (effectively immortal at scaletest timescales), short-running batch lives minutes.
Schema addition (per-archetype):
- name: cpu-batch meanLifetimeSeconds: 600 # 10-minute mean batch job- name: cpu-service meanLifetimeSeconds: 0 # 0 = effectively immortal, no agingThe load-driver tracks per-CR creation timestamps. On each tick, every CR with meanLifetimeSeconds > 0 is independently aged (exponential distribution) and replaced when sampled. Population stays at Target — replacements are created as fast as deletions. Long-lived archetypes contribute to baseline population; short-lived archetypes drive the bulk of the churn. The legacy churnPerMinute field stays as a fallback when no archetype has lifetimes set.
3. Concentrated bursts
A new burst runner action injects K extra CRs of a chosen archetype across N selected clusters at offset T, optionally reverses M seconds later. Models a Friday-afternoon deploy ramp: most clusters quiet, a few ramping hard simultaneously.
Schema addition:
runnerActions: - atSeconds: 300 action: burst-archetype burst: archetype: cpu-service clusters: 5 # how many clusters get the spike crsPerCluster: 2000 # extra CRs each durationSeconds: 60 # auto-reverse after this many secondsThe runner picks clusters random clusters at fire time, creates the extra CRs through their respective load-drivers, and (if durationSeconds > 0) deletes them after the window. Phase 1’s per-cluster hot-spot behaviour is finally exercised.
4. Same-rack-operator workloads
Per the BigFleet paper, the Same operator constrains a group of machines to share an attribute value (rack, AZ, blast domain). Cluster operators emit it during roll-up; it never appears in the user’s CRD (the CRD uses In/NotIn/Exists/DoesNotExist). Phase 1 / Phase 2’s spread-aware paths handle it. This is currently un-exercised in any scaletest profile.
Implementation requires two pieces:
Synthetic rack labels in the seed. Each Configured machine gets a topology.bigfleet/rack label drawn from a pool of N racks (e.g., 10 racks per zone). The seed knows the pool and emits labels deterministically. The load-driver-side seed mirror keeps the same rack pool.
A same-rack archetype attribute. When sameRack: true, the load-driver emits a Same requirement on topology.bigfleet/rack for the CR’s group. Group sizes are drawn from a small distribution (2-8 nodes typical for tightly-coupled training jobs).
- name: gpu-training sameRack: true groupSizeRange: [2, 8]Phase 3’s reclaim path and Phase 2’s preemption path now have a non-trivial Same-aware load to validate against.
5. Cluster-size skew
The harness currently runs uniform clusters. Real fleets are heavy-tailed: a few huge clusters (running batch in bulk), many medium, a long tail of small.
Schema addition (top-level):
kwok: clusterCount: 100 clusterSizeDistribution: - { fraction: 0.05, targetMultiplier: 5.0 } # 5% are 5× the base target - { fraction: 0.20, targetMultiplier: 2.0 } # 20% are 2× the base - { fraction: 0.50, targetMultiplier: 1.0 } # 50% baseline - { fraction: 0.25, targetMultiplier: 0.3 } # 25% smallThe harness picks each kwok pod’s effective Target from this distribution at deploy time. Per-cluster Phase 1 / Phase 3 work is no longer uniform; the per-cluster Configured count varies 5×-15× across the fleet, exercising the per-(cluster, fingerprint) cost distribution honestly.
Consequences
- The realistic.yaml catalog is rewritten to use the new schema (sizeBuckets, meanLifetimeSeconds, sameRack where applicable). Backward compatibility with the M31 inline schema is retained: an archetype with neither
sizeBucketsnorresourcesfalls back to “no resources required.” - The load-driver, the shard’s seed, and the runner all read the catalog. Adding sizeBuckets / sameRack / meanLifetimeSeconds touches
pkg/scaletest/archetype(shared types),test/scaletest/cmd/load-driver/main.go(CR build path + lifetime aging),cmd/bigfleet/shard.go(seed expansion), andtest/scaletest/cmd/scaletest-runner/main.go(burst runner action). - A new
scaleway-1m-realistic.yamlshape lands that uses every dimension. Existing scaleway-1m.yaml and scaleway-5m.yaml stay on the M29 single-shape catalog as compatibility baselines; the realistic profile is the new release-gate target. - Cycle p99 will likely rise vs. the M31 catalog. M30.{1,2}‘s fast paths fire less often — pin-only is gated on archetypes without resources (almost none), min-priority short-circuit is gated on no preemptable victims (rarely true with real priority diversity). Under ADR-0014’s posture this is fine: cycle p99 is a tracked metric, not a release gate; the binding-latency gate is what releases pass / fail on.
Same-operator code paths get their first under-load exposure. Existing unit and conformance tests cover correctness; the realistic harness validates throughput. Expect surfacing of inefficiencies in the Phase 1/Phase 2 group-aware code paths — those will be tracked as new milestones if found.- Bimodal lifetimes change the rollup-traffic profile. The operator’s per-CR ack QPS now sees bursts of deletes-and-creates from short-lived archetypes; the steady-state ack p99 SLO (12s) may need re-examination. Add to the open-question list for the first cloud run.
- Cluster-size skew interacts with multi-shard load balance. ADR-0007 binds clusters to shards at deploy time via the harness’s
c % shardReplicasmapping. With heavy-tailed sizes, shard load can drift up to ~3× across replicas (one shard owns the big cluster). The shard.replicas count is still per-deploy operator-chosen (ADR-0007); the harness records the per-shard load distribution in summary.json so this is measured, not papered over. - Backward compatibility for existing profiles. Profiles without the new schema fields keep behaving as before. M31’s scaleway-1m-realistic.yaml continues to work; the new profile lands alongside it as
scaleway-1m-realistic-v2.yaml(or replaces it after one release of overlap; final naming chosen at implementation time).