ADR-0017: per-CR binding latency is the user-facing metric; fingerprint fan-out is its own thing
Status: Accepted
Date: 2026-05-07
Context
ADR-0014 established that BigFleet’s release gate is binding-latency p99 — what users feel from “I asked for capacity” to “my workload is running on it.” The runner (M32) wired that gate against bigfleet_shard_provisioning_latency_seconds, the only available histogram at the time, with this comment in pkg/metrics/metrics.go:
Wall-clock from first rollup observing a (cluster, profile fingerprint) to a matching machine reaching Configured. Per-CR granularity is not preserved; this measures fingerprint-level fan-out latency.
The scaleway-500k regression run on 2026-05-06 (run id 20260506-211219-scaleway-500k, commit 9ebc1c9) surfaced the gap honestly:
- Algorithmic SLOs all green: cycle p99 = 792 ms, phase 1/2/3 = 1/1/15 ms, 0 shortfalls, 50 K/50 K active sustained.
bindingLatencyP99Seconds: 327.68 s— the histogram’s top bucket (32 768 / 100). The metric ramped to its maximum because at 50 clusters × 1 fingerprint each, the histogram only takes 50 samples, each measuring “first rollup observation → first matching machine Configured” across a 1 000-CR ramp window. That isn’t what a user feels per CR; it’s how long it took to provision the first machine of a brand-new fingerprint into a previously-empty pool.
Two things are conflated:
- User-visible binding latency — per-CR, “from CR (or Pod) creation to the moment my workload can run on a configured machine.” Sub-second on a Pod-mode kind run; minutes on a real cloud-provider with cold provisioning.
- Fingerprint fan-out latency — per-(cluster, fingerprint), “from first observation of a brand-new fingerprint to the first machine of that fingerprint reaching Configured.” Useful for capacity-planning conversations, irrelevant to release gating.
These are different metrics with different SLO targets. Treating one as a stand-in for the other gates releases on numbers that don’t represent what we promise.
Decision
Two changes:
1. New metric: per-Pod binding latency
A dedicated histogram measures per-Pod binding latency in Pod-mode runs. The bigfleet-scaletest-pod-shim observes both endpoints (Pod creation timestamp via the metadata.creationTimestamp field, Pod binding via its own clientset.CoreV1().Pods(ns).Bind call) and records the difference at the moment of binding:
bigfleet_scaletest_pod_bind_latency_seconds Help: Wall-clock from Pod.metadata.creationTimestamp to the moment the bigfleet-scaletest-pod-shim issues the binding subresource Create on a fake Node. Per-Pod granularity. This is the metric ADR-0014 names "binding-latency p99" — what users feel from "I asked for capacity" to "my Pod is running." Bucket layout: exponential 0.05 s → 102 s.The runner’s bindingLatencyP99Seconds query prefers this metric. When it’s unavailable (legacy CR-mode profiles that don’t run the pod-shim), the runner falls back to the existing fingerprint histogram and the profile is expected to declare a profile-level slo.bindingLatencyP99Seconds override that reflects the per-fingerprint shape.
2. Existing fingerprint histogram is renamed in spirit
bigfleet_shard_provisioning_latency_seconds keeps its name but its role changes from “binding latency proxy” to “fingerprint fan-out diagnostic.” The Help text is amended to make this explicit. The runner’s summary still surfaces it (informational), but the release gate doesn’t use it directly when a per-Pod metric is available.
3. CR-mode profiles use profile-level SLO overrides
Profiles that exercise CR-mode (load-driver creates CRs directly, no Pod-shim, no per-Pod histogram) declare a profile-level slo.bindingLatencyP99Seconds override that reflects the fingerprint-fan-out shape they actually measure. For scaleway-500k:
slo: bindingLatencyP99Seconds: 60 # fingerprint fan-out ≤ ramp windowThis is honest — the profile’s binding latency IS fingerprint-grained, the SLO target reflects that.
Consequences
- Pod-mode profiles get an honest user-facing release gate. dev-5k-pods-loopback (and any future Pod-mode runs) measure the actual Pod-creation-to-Pod-bound latency. Sub-second on the fake provider; whatever the real provider’s bring-up takes in production.
- CR-mode legacy profiles keep working with their existing fingerprint histogram, but the SLO target reflects what they actually measure. scaleway-500k’s
slo.bindingLatencyP99Seconds: 60is a documented profile shape, not a free pass. - Runner picks the right metric automatically. It tries
bigfleet_scaletest_pod_bind_latency_secondsfirst; if Prometheus returns no samples (no pod-shim → no histogram), falls back to the legacy provisioning histogram. The summary records which source was used so the verdict is reproducible. - The provisioning histogram becomes a planning tool. Operators reading
kubectl get availablecapacityplus the histogram now have an honest number for “how long does it take to fan out a new fingerprint?” — the question they actually wanted answered. - scaleway-500k re-runs pass. The 327.68-s bucket is no longer treated as a release-blocking number; the profile-level override codifies “fan-out at this profile shape ≤ 60 s” which is a defensible promise.
- Future ADRs. When a per-CR (not per-Pod) latency becomes interesting — e.g. when measuring CR-mode profiles directly without Pod-mode infrastructure — the obvious next step is an analogous histogram in the unschedulable-pod-controller (CR creation → CR-Acknowledged) or in the operator (rollup ack timestamp → first matching machine Configured). Out of scope for this ADR; the per-Pod metric covers the production-shaped Pod-mode path which is what we recommend running.
Implementation notes
The per-Pod histogram is recorded inside the pod-shim, NOT inside BigFleet itself — it’s harness instrumentation. Real production deployments don’t run the pod-shim; their per-Pod latency comes from kube-scheduler’s own metrics (scheduler_pod_scheduling_duration_seconds etc.) plus the time to provision the underlying capacity. The harness’s pod-shim is a stand-in for the kube-scheduler chain, and so is the right place to measure stand-in latency.
Addendum (2026-05-07): the legacy histogram is a diagnostic, not a gate
The first scaleway-500k re-run after this ADR landed surfaced a stricter problem with the legacy fingerprint histogram: it isn’t just coarse-grained, it’s unreliable for SLO gating during steady-state soaks. Mechanism:
bigfleet_shard_provisioning_latency_secondsis observed once per(cluster, fingerprint)at the moment the first machine of that fingerprint reaches Configured. In scaleway-500k that’s 50 observations total, all during the ramp.- During the 30-min soak that follows, no new fingerprints are introduced, so no new observations land.
- The runner’s PromQL uses
rate(...[5m])over the last 5 minutes of the soak, where the rate is zero.histogram_quantileover a zero-rate stable cumulative count returns the boundary of the bucket holding all prior observations — Prometheus’s float interpolation in this case settles on the +Inf-bucket boundary regardless of where the actual observations live. - Result: p99 reads as
0.01 × 2^15 = 327.68 seven though p50 still tracks the real ramp-time distribution (~6 s). The gate fires on a number that has no relationship to real latency.
The legacy histogram is fundamentally a fingerprint fan-out diagnostic — it answers “how long did it take to provision the first machine of a brand-new fingerprint?” That’s a useful planning signal but it can’t continuously measure binding latency, because in steady state there are no fresh first-machine-Configured events to observe.
Decision: the runner stops using the legacy histogram as a release-gate fallback. bindingLatencyP99Seconds queries only bigfleet_scaletest_pod_bind_latency_seconds (the M43c per-Pod histogram). In Pod-mode runs the gate fires on real per-Pod data. In CR-mode runs the metric is unavailable (NaN → -1), and pass() already treats -1 as “metric unavailable, skip the gate.” CR-mode runs are then gated only on the algorithmic SLOs that BigFleet actually controls:
shardCycleDurationP99Seconds— throughput envelope per ADR-0014.operatorRollupP99Seconds≤ 1 s.operatorAckP99Seconds≤ 12 s.shardShortfalls= 0.coordinatorApplyErrorRate≤ 0.001.operatorOutboxDropsPerSec= 0.loadgenCRsActive≥ 99.9 % of target throughout soak.
The legacy histogram is still exposed as shardProvisioningLatencyP{50,99}Seconds for diagnostic reading — its p50 remains a real signal, its p99 is an artefact in steady state. Profile comments and dashboard panels label it as diagnostic.
Future direction. When all profiles run Pod-mode (the realistic harness from M31/M33 + the dev-5k-pods-loopback validation from M43d), the per-Pod metric covers every release-gate path. The legacy histogram retires as a deprecated diagnostic, and the runner can drop the metric entirely.
We don’t bump CR-mode profile overrides to game the legacy metric. Gaming an unreliable metric to make a gate pass is worse than not gating on it at all — it normalises the artefact and disguises future regressions.
Addendum (2026-05-07): M44 — Pod-mode is the default
The previous addendum left CR-mode as the default and Pod-mode as opt-in. After running scaleway-500k under that arrangement we flipped it: Pod-mode is the realistic shape, it’s what users feel, so it’s the default. CR-mode becomes the explicit opt-out for profiles where the per-cluster Pod scale doesn’t fit the kwok kine budget without separate sizing work.
Concrete changes:
- Load-driver default: empty
loadProfile.modenormalises to"pods"inloadProfile. Was"cr"(legacy shape). - Chart kwok defaults bumped to the dev-5k-pods-loopback floor (apiserver + workload at 500m/1Gi req, 2/2Gi lim, 1Gi tmpfs). dev-5k-pods proved that’s the minimum where kine sqlite stops warning under combined CR + Pod + UpcomingNode + Node write load.
entrypoint-workload.shdefaultsPOD_MODEenv var topods; the chart only emits the env whenloadProfile.modeis non-empty, so unset → default-Pod-mode. Both pod-shim and unschedulable-pod-controller start by default.- Cloud profiles resized where the new floor doesn’t fit the prior pool: scaleway-{50k,500k} 2× PRO2-M → 2× PRO2-L; failover-* 2× PRO2-M → 2× PRO2-L. Cost ~doubles, buys the user-facing binding-latency gate at scale.
- CR-mode opt-outs kept on profiles where Pod-mode at the per-cluster scale would need separate sizing work: scaleway-{1m,5m} (10K Pods/cluster), scaleway-{1m,5m}-reprovision (1:1 reprovisioning regime, gated on convergence rate not binding latency), homelab-500k (homelab can’t fit 500-cluster Pod-mode floor), cloud-5m (5000 clusters), thundering-herd (peak 5K Pods/cluster), local-50k (M5 Max can’t fit Pod-mode floor at 50 clusters).
bindingLatencyP99SecondsSLO override added on every Pod-mode profile sopass()actively gates on it. CR-mode profiles still skip the gate via the -1 sentinel.scaleway-500k-pods.yamldeleted (folded into scaleway-500k.yaml — the regular profile is now Pod-mode by default).
The legacy fingerprint histogram stays exposed as a diagnostic until all profiles are Pod-mode; once the 1m/5m reshape lands the metric retires.