ADR-0017: per-CR binding latency is the user-facing metric; fingerprint fan-out is its own thing

Status: Accepted

Date: 2026-05-07

Context

ADR-0014 established that BigFleet’s release gate is binding-latency p99 — what users feel from “I asked for capacity” to “my workload is running on it.” The runner (M32) wired that gate against bigfleet_shard_provisioning_latency_seconds, the only available histogram at the time, with this comment in pkg/metrics/metrics.go:

Wall-clock from first rollup observing a (cluster, profile fingerprint) to a matching machine reaching Configured. Per-CR granularity is not preserved; this measures fingerprint-level fan-out latency.

The scaleway-500k regression run on 2026-05-06 (run id 20260506-211219-scaleway-500k, commit 9ebc1c9) surfaced the gap honestly:

Algorithmic SLOs all green: cycle p99 = 792 ms, phase 1/2/3 = 1/1/15 ms, 0 shortfalls, 50 K/50 K active sustained.
bindingLatencyP99Seconds: 327.68 s — the histogram’s top bucket (32 768 / 100). The metric ramped to its maximum because at 50 clusters × 1 fingerprint each, the histogram only takes 50 samples, each measuring “first rollup observation → first matching machine Configured” across a 1 000-CR ramp window. That isn’t what a user feels per CR; it’s how long it took to provision the first machine of a brand-new fingerprint into a previously-empty pool.

Two things are conflated:

User-visible binding latency — per-CR, “from CR (or Pod) creation to the moment my workload can run on a configured machine.” Sub-second on a Pod-mode kind run; minutes on a real cloud-provider with cold provisioning.
Fingerprint fan-out latency — per-(cluster, fingerprint), “from first observation of a brand-new fingerprint to the first machine of that fingerprint reaching Configured.” Useful for capacity-planning conversations, irrelevant to release gating.

These are different metrics with different SLO targets. Treating one as a stand-in for the other gates releases on numbers that don’t represent what we promise.

Decision

Two changes:

1. New metric: per-Pod binding latency

A dedicated histogram measures per-Pod binding latency in Pod-mode runs. The bigfleet-scaletest-pod-shim observes both endpoints (Pod creation timestamp via the metadata.creationTimestamp field, Pod binding via its own clientset.CoreV1().Pods(ns).Bind call) and records the difference at the moment of binding:

bigfleet_scaletest_pod_bind_latency_seconds
  Help: Wall-clock from Pod.metadata.creationTimestamp to the
  moment the bigfleet-scaletest-pod-shim issues the binding
  subresource Create on a fake Node. Per-Pod granularity. This
  is the metric ADR-0014 names "binding-latency p99" — what
  users feel from "I asked for capacity" to "my Pod is running."
  Bucket layout: exponential 0.05 s → 102 s.

The runner’s bindingLatencyP99Seconds query prefers this metric. When it’s unavailable (legacy CR-mode profiles that don’t run the pod-shim), the runner falls back to the existing fingerprint histogram and the profile is expected to declare a profile-level slo.bindingLatencyP99Seconds override that reflects the per-fingerprint shape.

2. Existing fingerprint histogram is renamed in spirit

bigfleet_shard_provisioning_latency_seconds keeps its name but its role changes from “binding latency proxy” to “fingerprint fan-out diagnostic.” The Help text is amended to make this explicit. The runner’s summary still surfaces it (informational), but the release gate doesn’t use it directly when a per-Pod metric is available.

3. CR-mode profiles use profile-level SLO overrides

Profiles that exercise CR-mode (load-driver creates CRs directly, no Pod-shim, no per-Pod histogram) declare a profile-level slo.bindingLatencyP99Seconds override that reflects the fingerprint-fan-out shape they actually measure. For scaleway-500k:

slo:
  bindingLatencyP99Seconds: 60   # fingerprint fan-out ≤ ramp window

This is honest — the profile’s binding latency IS fingerprint-grained, the SLO target reflects that.

Consequences

Pod-mode profiles get an honest user-facing release gate. dev-5k-pods-loopback (and any future Pod-mode runs) measure the actual Pod-creation-to-Pod-bound latency. Sub-second on the fake provider; whatever the real provider’s bring-up takes in production.
CR-mode legacy profiles keep working with their existing fingerprint histogram, but the SLO target reflects what they actually measure. scaleway-500k’s slo.bindingLatencyP99Seconds: 60 is a documented profile shape, not a free pass.
Runner picks the right metric automatically. It tries bigfleet_scaletest_pod_bind_latency_seconds first; if Prometheus returns no samples (no pod-shim → no histogram), falls back to the legacy provisioning histogram. The summary records which source was used so the verdict is reproducible.
The provisioning histogram becomes a planning tool. Operators reading kubectl get availablecapacity plus the histogram now have an honest number for “how long does it take to fan out a new fingerprint?” — the question they actually wanted answered.
scaleway-500k re-runs pass. The 327.68-s bucket is no longer treated as a release-blocking number; the profile-level override codifies “fan-out at this profile shape ≤ 60 s” which is a defensible promise.
Future ADRs. When a per-CR (not per-Pod) latency becomes interesting — e.g. when measuring CR-mode profiles directly without Pod-mode infrastructure — the obvious next step is an analogous histogram in the unschedulable-pod-controller (CR creation → CR-Acknowledged) or in the operator (rollup ack timestamp → first matching machine Configured). Out of scope for this ADR; the per-Pod metric covers the production-shaped Pod-mode path which is what we recommend running.

Implementation notes

The per-Pod histogram is recorded inside the pod-shim, NOT inside BigFleet itself — it’s harness instrumentation. Real production deployments don’t run the pod-shim; their per-Pod latency comes from kube-scheduler’s own metrics (scheduler_pod_scheduling_duration_seconds etc.) plus the time to provision the underlying capacity. The harness’s pod-shim is a stand-in for the kube-scheduler chain, and so is the right place to measure stand-in latency.

Addendum (2026-05-07): the legacy histogram is a diagnostic, not a gate

The first scaleway-500k re-run after this ADR landed surfaced a stricter problem with the legacy fingerprint histogram: it isn’t just coarse-grained, it’s unreliable for SLO gating during steady-state soaks. Mechanism:

bigfleet_shard_provisioning_latency_seconds is observed once per (cluster, fingerprint) at the moment the first machine of that fingerprint reaches Configured. In scaleway-500k that’s 50 observations total, all during the ramp.
During the 30-min soak that follows, no new fingerprints are introduced, so no new observations land.
The runner’s PromQL uses rate(...[5m]) over the last 5 minutes of the soak, where the rate is zero. histogram_quantile over a zero-rate stable cumulative count returns the boundary of the bucket holding all prior observations — Prometheus’s float interpolation in this case settles on the +Inf-bucket boundary regardless of where the actual observations live.
Result: p99 reads as 0.01 × 2^15 = 327.68 s even though p50 still tracks the real ramp-time distribution (~6 s). The gate fires on a number that has no relationship to real latency.

The legacy histogram is fundamentally a fingerprint fan-out diagnostic — it answers “how long did it take to provision the first machine of a brand-new fingerprint?” That’s a useful planning signal but it can’t continuously measure binding latency, because in steady state there are no fresh first-machine-Configured events to observe.

Decision: the runner stops using the legacy histogram as a release-gate fallback. bindingLatencyP99Seconds queries only bigfleet_scaletest_pod_bind_latency_seconds (the M43c per-Pod histogram). In Pod-mode runs the gate fires on real per-Pod data. In CR-mode runs the metric is unavailable (NaN → -1), and pass() already treats -1 as “metric unavailable, skip the gate.” CR-mode runs are then gated only on the algorithmic SLOs that BigFleet actually controls:

shardCycleDurationP99Seconds — throughput envelope per ADR-0014.
operatorRollupP99Seconds ≤ 1 s.
operatorAckP99Seconds ≤ 12 s.
shardShortfalls = 0.
coordinatorApplyErrorRate ≤ 0.001.
operatorOutboxDropsPerSec = 0.
loadgenCRsActive ≥ 99.9 % of target throughout soak.

The legacy histogram is still exposed as shardProvisioningLatencyP{50,99}Seconds for diagnostic reading — its p50 remains a real signal, its p99 is an artefact in steady state. Profile comments and dashboard panels label it as diagnostic.

Future direction. When all profiles run Pod-mode (the realistic harness from M31/M33 + the dev-5k-pods-loopback validation from M43d), the per-Pod metric covers every release-gate path. The legacy histogram retires as a deprecated diagnostic, and the runner can drop the metric entirely.

We don’t bump CR-mode profile overrides to game the legacy metric. Gaming an unreliable metric to make a gate pass is worse than not gating on it at all — it normalises the artefact and disguises future regressions.

Addendum (2026-05-07): M44 — Pod-mode is the default

The previous addendum left CR-mode as the default and Pod-mode as opt-in. After running scaleway-500k under that arrangement we flipped it: Pod-mode is the realistic shape, it’s what users feel, so it’s the default. CR-mode becomes the explicit opt-out for profiles where the per-cluster Pod scale doesn’t fit the kwok kine budget without separate sizing work.

Concrete changes:

Load-driver default: empty loadProfile.mode normalises to "pods" in loadProfile. Was "cr" (legacy shape).
Chart kwok defaults bumped to the dev-5k-pods-loopback floor (apiserver + workload at 500m/1Gi req, 2/2Gi lim, 1Gi tmpfs). dev-5k-pods proved that’s the minimum where kine sqlite stops warning under combined CR + Pod + UpcomingNode + Node write load.
entrypoint-workload.sh defaults POD_MODE env var to pods; the chart only emits the env when loadProfile.mode is non-empty, so unset → default-Pod-mode. Both pod-shim and unschedulable-pod-controller start by default.
Cloud profiles resized where the new floor doesn’t fit the prior pool: scaleway-{50k,500k} 2× PRO2-M → 2× PRO2-L; failover-* 2× PRO2-M → 2× PRO2-L. Cost ~doubles, buys the user-facing binding-latency gate at scale.
CR-mode opt-outs kept on profiles where Pod-mode at the per-cluster scale would need separate sizing work: scaleway-{1m,5m} (10K Pods/cluster), scaleway-{1m,5m}-reprovision (1:1 reprovisioning regime, gated on convergence rate not binding latency), homelab-500k (homelab can’t fit 500-cluster Pod-mode floor), cloud-5m (5000 clusters), thundering-herd (peak 5K Pods/cluster), local-50k (M5 Max can’t fit Pod-mode floor at 50 clusters).
bindingLatencyP99Seconds SLO override added on every Pod-mode profile so pass() actively gates on it. CR-mode profiles still skip the gate via the -1 sentinel.
scaleway-500k-pods.yaml deleted (folded into scaleway-500k.yaml — the regular profile is now Pod-mode by default).

The legacy fingerprint histogram stays exposed as a diagnostic until all profiles are Pod-mode; once the 1m/5m reshape lands the metric retires.