ADR-0020: Internal binding-latency SLO must respect the rollup interval
Status: Accepted
Date: 2026-05-08
Context
The scaleway-50k Pod-mode cloud run (M44 default; 50 kwok clusters × 1K Pods, 30-min soak) failed pass() with internalBindingLatencyP99Seconds 50.864 s > 5 s SLO. Two rounds of optimisation followed:
- Drop A (
5bbae86…7668f50): freshinv.Snapshot()per cycle,needs.Table.Snapshotindex-sort, pod-shim binderMaxConcurrentReconciles 16 → 64, operator/pod-shim/upc QPS plumbed from one knob. - These collapsed shard cycle p99 from 8.192 s to 0.294 s and demand cleared at 49996/49950 active. Internal binding latency stayed at 50.86 s.
The remaining latency is dominated by a structural floor the harness baked in but the SLO never accounted for: the operator’s rollupInterval defaults to 10 s (pkg/operator/operator.go:131). A Pod that arrives just after a rollup tick waits up to one full interval before the shard sees its CR, so the 10 s rollup interval is a hard p99 floor on internal binding latency.
Stage-by-stage budget (post-fix, scaleway-50k shape):
| Stage | p99 contribution | Notes |
|---|---|---|
| Pod create → Unschedulable patch (pod-shim) | ~50 ms | trivial under any QPS |
| Unschedulable → CR (unschedulable-pod-controller) | ~50 ms | trivial |
| CR → next rollup tick | up to 10 s | rollupInterval default |
| Rollup processing (operator) | 60 ms | measured |
| Shard cycle (Phase 1 → Bootstrap) | 0.3 s | post-Drop-A |
| NodeStateUpdate → UpcomingNode | ~50 ms | apiserver write |
| Binder burst-drain tail (post-fix) | 4–5 s | M44.4 binder fix |
| Bind subresource | ~50 ms | trivial |
| Sum | ~15 s |
A 5 s SLO is therefore unachievable for any profile that runs the operator with the production-default 10 s rollup interval, regardless of how fast the rest of the chain becomes. The choice is binary: lower rollupInterval per-profile (treating it as a tunable rather than a production posture), or raise the SLO to acknowledge it as a floor.
Decision
The internal-binding-latency SLO is set to 15 s. The 10 s rollup interval is treated as a deliberate production posture, and the SLO is sized as rollupInterval + 5 s of chain-execution headroom.
Specifically:
- The runner’s default
internalBindingLatencyP99Secondsbecomes 15 s (was 5 s). - The four cloud profiles (
scaleway-{50k,500k,1m,5m}) carry an explicitinternalBindingLatencyP99Seconds: 15override rather than falling through to the default. Explicit is more self-documenting — anyone reading the profile sees the active SLO without having to know the default. - A profile that lowers
rollupIntervalmay also lower the SLO override accordingly. The relationship is documented as a guideline:internalBindingLatencyP99 ≤ rollupInterval + 5 sis the recommended ceiling at the cluster sizes this harness covers. - ADR-0014’s tiered targets (5 s / 60 s / 300 s for critical / services / batch) describe the user-facing ceiling — the floor a deployer can promise to their workloads. ADR-0018 already established that the harness’s
internalBindingLatencyis internal-only, contributing to but not equal to the user-facing number. This ADR closes the loop: the harness’s internal SLO must respect the harness’s own structural floor.
Why not lower rollupInterval instead?
Lowering rollupInterval is the obvious alternative. We considered it and chose the SLO change because:
- 10 s is what production should default to. The rollup carries the cluster’s complete desired state to the shard; it’s a fixed-cost operation independent of churn. Firing it every 10 s instead of every 1 s gives 10× fewer messages on the operator → shard stream and 10× less aggregation work in the operator without meaningfully degrading user-facing latency (real-provider capacity-create time dwarfs 10 s rollup batching). The CLAUDE.md
## Hard rulesenumerate the operator-side defaults audit (M39); 10 s rollup is one of those audited defaults. - Sub-second rollups would mask regressions elsewhere. If we lowered rollup to 1 s the SLO could land at 5 s and the test would pass — but the harness would no longer exercise the production rollup-batching shape. Future regressions in chain stages other than rollup wouldn’t surface until they exceeded ~5 s individually rather than being visible against a 15 s envelope.
- The SLO is an internal regression detector, not a user promise (ADR-0018). A 15 s ceiling is more than tight enough to catch the kinds of regressions this harness was built to catch (Drop A’s 50 s blowup, Drop B’s worker-pool saturation, M11.x-class pathologies). User-facing latency lives in provider-conformance + production canaries (ADR-0018 §Real-provider validation).
What this ADR is not
- It does not raise ADR-0014’s user-facing tiers. Critical workloads still need ~5 s end-to-end binding for users to feel the system as responsive — but that 5 s budget is split across
internal + provider, with provider eating most of it for cold-starts and minimal for warm-pool attaches. - It does not lower
rollupIntervalper profile. If a future profile’s posture warrants sub-10 s rollup, that’s a separate decision recorded in that profile’s preamble. - It does not weaken the regression-detector function of
pass(). A 15 s ceiling still catches every Drop A / Drop B class issue we’ve encountered; what it doesn’t catch (sub-15 s drift) was already invisible to the shard-cycle envelope.
Consequences
pass()default:internalBindingLatencyP99Seconds: 5 → 15.- Profile overrides:
dev-5k.yamlkeeps its10(kind-local; in-process; should comfortably stay under). The four cloud profiles (scaleway-{50k,500k,1m,5m}) update their explicit5to15. - ADR-0018 rendering of “harness internal-only” gains a concrete floor: 15 s = ~10 s rollup + ~5 s chain. Linked from ADR-0014.
- The runner’s
pass()comment block updates the bullet forinternalBindingLatencyP99 ≤ 5 sto≤ 15 s, with the rollup-interval rationale. - The Grafana dashboard’s “Internal binding latency p99” stat panel threshold mapping updates from 5 to 15.
Validation
The next scaleway-50k cloud run (post-Drop-A, post-binder-fix) is expected to land at 10–18 s p99 if all stages behave as modelled. Anything outside that band — significantly under, or over — is a signal worth investigating: under means the harness isn’t exercising the rollup ceiling (rollupInterval was inadvertently shorter than 10 s), over means a downstream stage other than rollup is contributing more than its 5 s budget.