ADR-0034: Scaletest is bring-your-own-substrate
Status
Accepted, 2026-05-19.
Context
test/scaletest/profiles/*.yaml today bundles two unrelated things
into one file: the test definition (scale, catalog, density,
ramp, churn, soak) and the substrate config (per-host
capacity, per-cluster apiserver Pod ceiling, storage backend,
kwok-pod resource requests, cost). The bundle is encoded in the
filename itself: scaleway-50k and uber-50k describe the same
scale of test but with different cluster geometries, kwok-pod
resources, and storage backends because the substrates have
different per-host capacity.
This entanglement has three observable costs:
- Filename leakage.
uber-*andscaleway-*name a runtime choice as if it were a property of the system-under-test. A public BigFleet user picks up the repo and sees five “scaleway” profiles plus five “uber” profiles and has to read both to figure out which (if any) matches their cloud. The substrate prefix doesn’t belong in the filename. - Profile × substrate combinatorial growth. Today’s 5 scales × 2 substrates = 10 files. Adding a third substrate (kind-laptop above dev-50/dev-500, GKE Autopilot, EKS spot) would push to 15, then 20. Each new file is mostly a copy of the same scale description with substrate-specific tuning swapped in.
- No clean reuse for outside users. A public-BigFleet user running their first scaletest writes a profile that conflates “what test” with “where to run” — and then has to re-derive that for every scale they want to validate against.
Inspecting scaleway-50k.yaml and uber-50k.yaml side-by-side
makes the structural point: the scale, catalog, density, ramp,
soak, churn, seed-fraction, priority distribution, and runner-
action fields are identical or trivially-equivalent across the
two files. The only fields that differ are the substrate-specific
ones — cluster count, per-cluster Pods, kwok-pod resources, storage
backend, cost estimate. The framework is already substrate-
agnostic in its test-definition surface; the file layout just
doesn’t reflect it.
Goals
- Scaletest profiles describe the test, not the runtime. A profile YAML carries scale, catalog, density, ramp/soak/churn, priority distribution, failure injection — everything the system-under-test sees. It carries no substrate-specific tuning.
- Substrates are user-supplied and orthogonal. A substrate YAML carries per-host capacity, per-cluster Pod ceiling, storage backend, kwok-pod resource budgets, and cost. BigFleet ships three example substrates as a starting point; users describe their own infra in the same format.
- The runner derives geometry from the cross product.
clusterCount = ceil(profile.machines × profile.density / substrate.cluster.podsPerCluster). Cost =hosts × perHostUsdPerHour × duration. Ramp budget validation against the substrate’s declared bind throughput. - No public-facing substrate names. The example substrates
are named by shape (e.g.
example-fat-host,example-mid-host,example-kind-laptop), not by provider. - Behaviour-neutral migration. The split is a refactor, not
a semantic change. Pre-migration scale-test results must
reproduce post-migration when the same scale and substrate
parameters are supplied as
--profile=<scale>.yaml --substrate=<substrate>.yaml.
Non-goals
- Auto-selecting a substrate based on cluster context. The user
always passes
--substrate=explicitly. Implicit defaults are the wrong direction for “where am I about to spend money”. - A substrate registry / marketplace. Substrates are local files. There’s no remote catalogue, no online lookup, no “official” substrate beyond the three examples we ship.
- Multi-substrate composition. A run uses one substrate. A test that wants to validate multi-region / multi-substrate behaviour belongs in a dedicated failover-* profile that embeds the cross-substrate scenario.
- Substrate-side parameter sweeps. The runner takes one profile + one substrate. Sweeping is a shell-loop concern, not a runner feature.
- Changes to ADR-0029 / ADR-0033 / ADR-0032 semantics. This is a file-layout refactor; algorithmic content is unchanged.
Decision
Split each existing *-Nk.yaml profile into a substrate-agnostic
test definition + a separately-named substrate file. The runner
takes both and merges them into the Helm values it installs.
Substrate schema (canonical)
# test/scaletest/substrates/<name>.yamlapiVersion: bigfleet.io/scaletest/v1kind: Substratemetadata: name: example-fat-host description: "80vCPU / 160GiB hosts with etcd-backed kwok apiservers."
host: # Per-host resource budget. The runner packs cluster.clustersPerHost # kwok pods per host plus the BigFleet system-under-test pods # (shard, coordinator, prometheus) onto the first host. vCPU: 80 memoryGiB: 160
cluster: # The per-kwok-apiserver operating point. podsPerCluster is the # substrate's "comfortable Pod ceiling" — past this point bind # throughput tails off (kine sqlite WAL pressure, etcd watch # latency, kube-scheduler list-watch cost, etc.). podsPerCluster: 25000 clustersPerHost: 10 storage: etcd # or "kine" # Empirical per-cluster bind throughput. The runner uses this to # validate that profile.rampSeconds × bindThroughputPodsPerSec # covers profile's total Pod ramp; emits a warning if not. bindThroughputPodsPerSec: 30
kwokPod: # The kwok pod (apiserver + kwok-controller + operator + load-driver) # resource budget. Tuned to the substrate's per-host capacity and # podsPerCluster. requests: { cpu: 2, memory: 4Gi } limits: { cpu: 8, memory: 32Gi } # Optional. Default 1Gi. Past 25K Pods/cluster, kine's sqlite WAL # needs more headroom; etcd-backed substrates can leave this at # default. sharedVolumeSizeLimit: "2Gi"
apiserver: # Optional. Substrate-specific apiserver tuning (max-mutating-req, # max-readonly-req, watch-cache-size). Helm chart applies as # --extra-flags. Default: chart defaults (sane for kine on 25K Pods). extraFlags: []
# Cost model. Free / on-prem substrates set perHostUsdPerHour: 0.costEstimate: perHostUsdPerHour: 0 notes: ""
# Optional: provisioning hints. Free-form Markdown the runner prints# during --dry-run; not validated.provisioning: | See substrate docs for how to spin up matching hosts.Profile schema (what stays)
# test/scaletest/profiles/<name>.yamlapiVersion: bigfleet.io/scaletest/v1kind: Profilemetadata: name: 50k description: "50K machines, 5M aggregated Pods at density=100."
scale: machines: 50000 density: 100 # Pods per machine # Total Pods = machines × density = 5_000_000
catalog: # ADR-0032: realistic six-archetype workload distribution. archetypes: realistic # or "uniform" for synthetic-bench shape
seed: # Pre-seeded inventory at runner install time. configuredFraction: 0.0 # 0 = cold start; 1 = fully pre-seeded speculativeMultiplier: 3 # ADR-0026: elastic tier idleHeadroomFraction: 0.2 # 20% Idle headroom for Phase 1
loadProfile: rampSeconds: 1800 soakSeconds: 1800 churnPerMinute: 0.02
# Optional. Failure-injection hooks for failover-* profiles.runnerActions: []Runner merge semantics
clusterCount = ceil(profile.scale.machines × profile.scale.density / substrate.cluster.podsPerCluster)
hostsNeeded = ceil(clusterCount / substrate.cluster.clustersPerHost) + 1 # +1 for BigFleet system-under-test pods
estimatedCost = hostsNeeded × substrate.costEstimate.perHostUsdPerHour × (rampSeconds + soakSeconds + 600) / 3600 # +10min teardownValidation (runs at start, before any helm install):
- Ramp feasibility.
profile.scale.machines × profile.scale.density / profile.loadProfile.rampSeconds ≤ clusterCount × substrate.cluster.bindThroughputPodsPerSec. If not, warn — the ramp will tail off past the budget. (Not fatal; some tests want to exercise that regime.) - Resource fit.
hostsNeeded × substrate.host.vCPUis sane (i.e. ≥ what the chart will request). The chart’s resource requests are derived from substrate.kwokPod.requests × clusterCount- system-under-test resources.
- Cost ceiling.
estimatedCost ≤ --max-cost-usd(default $50). Prompt for confirmation if cloud-context regex matches, per the existing cost-guardrails surface.
File layout (post-migration)
test/scaletest/ profiles/ dev-50.yaml # laptop integration gate dev-500.yaml # laptop rehearsal 5k.yaml 50k.yaml 500k.yaml 1m.yaml 5m.yaml failover-leader-kill.yaml failover-shard-kill.yaml failover-partition.yaml failover-soak.yaml substrates/ README.md # schema doc + how to write your own example-fat-host.yaml # 80/160 GiB, 25K Pods/cluster, etcd example-mid-host.yaml # 32/128 GiB, 100K Pods/cluster, kine example-kind-laptop.yaml # M5 Max kind, 2.5–10K Pods/cluster, tmpfs kineFailover profiles embed their own cluster count today (50 × 1K) because the test point is static-stability behaviour at a fixed disturbance, not a scale-derived geometry. They stay as-is with a substrate-derived per-host packing.
Canonical invocation
# A 5K-machine scale-test on the fat-host example substrate:scaletest-runner \ --profile=test/scaletest/profiles/5k.yaml \ --substrate=test/scaletest/substrates/example-fat-host.yaml \ --duration=30mmake scale-5k etc. shortcuts continue to exist; they default to
example-fat-host as the canonical validation substrate.
Migration plan
- Stage 0: ADR sign-off (this document).
- Stage 1: Substrate schema + Go types in
pkg/scaletest/config.make generate-tested round-trip. - Stage 2: Runner merge logic. Unit-tested geometry derivation, ramp-feasibility validation, cost estimation.
- Stage 3: Helm chart parameterization. The chart already takes a values file; the runner’s job is to assemble the merged values from profile + substrate. Minimal chart change beyond exposing the geometry knobs that are currently per-profile.
- Stage 4: Migrate existing profiles. Each
*-Nk.yamlis split into a substrate-agnostic<scale>.yamlplus the matching example substrate file. dev-* and failover-* profiles pair withexample-kind-laptop.yamland an appropriate substrate respectively, without test-definition reshape. - Stage 5: Canonical reproduce. Run each new
<scale>.yaml + <substrate>.yamlcombination and compare bind ramp, cycle p99, and operator rollup p99 to the corresponding pre-migration baseline. Verdict must be “no measurable delta” for every (scale, substrate) pair before the legacy files are deleted. - Stage 6: Runbook update. The scaletest runbook reframes the profile table as “scales × your substrate”.
- Stage 7: Delete the legacy
scaleway-*anduber-*profile files. Site sync picks up the new layout.
Roughly 4–6 hours of careful work. The in-flight OC1/OC2/OC3 disambiguation experiment from ADR-0033 can complete on the legacy naming and its verdict applies unchanged to the new layout — the SHA and behaviour are what matter, not the file path.
Alternatives considered
Option B: rename uber-* → bare scale names, drop scaleway-*
Simplest move. Half a commit. Loses the substrate-aware framing entirely and bakes today’s substrate-specific tuning into the “canonical” profile shape. If a public user shows up wanting to run on a different substrate, they’re back to forking + tuning each profile individually.
Rejected because the entanglement is the actual problem, not just the prefix. Doing the rename without the split is moving the wart, not fixing it.
Status quo: keep scaleway-* + uber-* profiles
Cheapest in lines-of-code. Costs us a substrate prefix in every filename, a confidentiality wart that needs scrubbing per public artifact, and an N × M file growth as new substrates appear.
Rejected because the framing was wrong from the start; the recent runbook update is the third or fourth round of working around it.
Parametric substrate (--substrate-host-vcpu=80 ... CLI flags)
Substrate-as-flags instead of substrate-as-file. Equivalent
information; worse ergonomics. Three example substrate YAMLs are
self-documenting in a way man scaletest-runner isn’t, and a
user who’s iterating on a substrate description benefits from
file-based version control.
Rejected for ergonomic reasons; the data model is the same either way.
Hard rules touched
None of the load-bearing BigFleet hard rules (CLAUDE.md §“Hard rules”). This ADR is purely about the scaletest harness file layout; the system-under-test contract is unchanged. Specifically:
- Provider RPC surface: unchanged.
- Coordinator / shard / operator wire format: unchanged.
- Cost formula: unchanged.
- Static stability: unchanged. (Failover profiles continue to exercise it; they just get their per-host packing from a substrate file.)
Under this contract the test-vs-runtime separation becomes explicit in the file layout: the profile is the canonical test definition; the substrate is the user’s runtime choice.
References
- ADR-0026 Scaletest models the Speculative tier.
- ADR-0028 Cycle-p99 is regime-parametric.
- ADR-0032 Realistic catalog production-calibrated workload distribution.
- ADR-0033 Phase 1 supply-credit must respect bind readiness (the active design discussion; its in-flight validation experiment applies unchanged to the post-migration layout — the SHA and behaviour matter, not the file path).
docs/scaletest.md(the runbook; reshapes under Stage 6).