ADR-0021: Persistent execute pool — decouple action execution from cycle barrier
Status: Accepted
Date: 2026-05-08
Context
The shard’s cycle (pkg/shard/shard.go’s cycle function) computes a snapshot, runs Phase 1/2/3 to emit actions, then dispatches those actions across N goroutines (executeConcurrency) and waits for all of them to finish via wg.Wait() before moving on. The cycle’s context (cycleCtx, deadlined at s.cfg.CycleInterval) is the parent of every action’s context, so when the deadline fires, in-flight RPCs are cancelled mid-call.
Three things follow from that single barrier:
-
Cycle wall-clock = max(action_latency). With 256 workers and operator handler mean ~3 s, p99 ~7 s, the cycle’s slowest worker drags the whole cycle to ~7 s. Throughput becomes
executeConcurrency / max(action_latency)rather thanexecuteConcurrency / mean(action_latency). -
Cycle ctx cancellation cascades into machine state. Until M44.4 we marked machines
Failed(terminal) oncontext.Canceled; we softened it to aConfiguring → Idlerollback, but the cancellation still wastes the in-flight RPC and the action gets re-emitted next cycle. At burst, this thrashes. -
Phase emissions are coupled to drain rate, not arrival rate. A cycle can’t emit more actions until the previous batch completes. If the operator side is briefly slow (apiserver latency spike), the entire shard pauses.
The scaleway-50k cloud runs in this thread surfaced this hard. With every operator-side and harness-side optimisation in place — coalescer, recv-pump async, dispatch sem tuned, sand smoothing — we still hit a ~25 binds/sec ceiling on the chain because each cycle waits for the slow tail of its 256-action batch. That’s below the steady-state churn rate of ~41 CRs/sec. The chain converges (Failed cascade is dead), but it can’t sustain steady state.
Decision
Replace the per-cycle dispatch + wg.Wait pattern with a shard-scoped persistent worker pool that:
- Spawns
executeConcurrencygoroutines once, at shard start. - Workers loop forever, draining a single shard-scoped buffered channel
actionQueue. - Each worker derives its own per-action context (
shardCtx + ExecuteTimeout) — independent of any cycle deadline. - The cycle now just enqueues actions and returns immediately. No
wg.Wait. - The queue is bounded; if it’s full, the cycle drops new emissions and increments a
droppedcounter. Phase emissions are idempotent against an unchanged snapshot, so the next cycle re-derives.
Concretely:
type Shard struct { // ... existing fields ... actionQueue chan decision.Action // size: executeConcurrency * 2}
func (s *Shard) Run(ctx context.Context) error { s.actionQueue = make(chan decision.Action, s.cfg.ExecuteConcurrency*2) for i := 0; i < s.cfg.ExecuteConcurrency; i++ { go s.executeWorker(ctx) } // ... existing cycle loop ...}
func (s *Shard) executeWorker(ctx context.Context) { for { select { case <-ctx.Done(): return case a := <-s.actionQueue: execCtx, cancel := context.WithTimeout(ctx, s.cfg.ExecuteTimeout) _ = s.execute(execCtx, a) // err already logged in classifyExecuteError cancel() } }}
// in cycle, replace the dispatch+wait with:for _, a := range all { select { case s.actionQueue <- a: default: metrics.ShardActionsDropped.Inc() }}// no wg.WaitExecuteTimeout is a new config field (default 30 s; matches the existing BootstrapTimeout). Each action is independent; cycle deadlines do not gate them.
Consequences
What this changes (correctness)
-
Stale-snapshot risk increases. An action emitted at cycle T might run during cycle T+5 (handler-mean × queue-depth seconds later). By then the world may have moved. Mitigation already in place: every executor (
executeBootstrap,executeProvision,executeDrain) starts with a state-machine validation against the current inventory (vias.inv.Get), andapplyTransitionrejects invalid transitions. A stale action gets a soft-fail and the worker logs and moves on. Phase 1 re-emits next cycle. -
Duplicate emissions across cycles increase. If cycle T emits Bootstrap for machine M, and cycle T+1 fires before M has been transitioned to Configuring, T+1’s Phase 1 sees M as Idle and emits Bootstrap again. Both queue. The first worker to win the state-machine race transitions M; the second sees M is no longer Idle and soft-fails. Same mitigation as above. We could also add a per-machine in-flight set to prevent the duplicate emission, but it adds complexity without changing the outcome.
-
Cycle ctx cancellation no longer affects in-flight executes. The Configuring → Idle rollback transition added in M44.4 stays, but its trigger (cycle ctx fire) becomes rare — only an operator RPC that takes longer than
ExecuteTimeout(30 s default) cancels.
What this changes (observability)
-
ShardCyclePhaseDuration{phase="execute"}becomes near-instant — it now measures only the time to enqueue actions. The wall-clock of action execution moves to a new histogramShardActionDuration{kind=...}. -
ShardActionsDeferredsemantics change. Previously: actions truncated byMaxActionsPerCyclecap. Now: actions dropped because the queue is full. Same name, same intent (something the cycle wanted to emit but couldn’t), different mechanism. Documented in the metric’s help text. -
ShardActionQueueDepth(new gauge): current depth of the action queue. Useful for spotting back-pressure: queue depth steadily climbing means workers can’t keep up with cycle emissions. -
ShardExecuteInflight(existing gauge) keeps working — it’s now bounded byexecuteConcurrencyregardless of cycle phase.
What this doesn’t change
- The shard’s overall API and the CapacityProvider contract are untouched.
executeConcurrencyandMaxActionsPerCycleprofile knobs keep their meanings.BootstrapTimeoutkeeps its meaning (per-RPC blob fetch deadline).- All existing executors (
executeBootstrap,executeProvision,executeDrain) are unmodified — they already handle the “world has moved” case. The change is upstream, in how they’re invoked. - The static-stability invariant: clusters keep running with BigFleet entirely down. (Workers exit cleanly on
shardCtx.Done(); in-flight actions abort; nothing is in an undefined state.)
Throughput implication
Pre-change:
throughput = executeConcurrency / max(action_latency_p99) = 256 / 7 s = 36 actions/secPost-change:
throughput = executeConcurrency / mean(action_latency) = 256 / 3 s = 85 actions/secThat’s roughly a 2.4× lift assuming today’s operator handler latency. With the steady-state churn rate of 41 CRs/sec on scaleway-50k, that comfortably clears.
Alternatives considered
-
Just bump cycle ctx to 60 s. Reduces ctx_canceled but doesn’t fix the wg.Wait barrier — cycle still waits for slowest worker, just takes longer to do it. Throughput stays max-bound. Plus the cycle is now sluggish to react to new demand.
-
Bump executeConcurrency to 1024. With handler mean 3 s, queueing on the apiserver-side QPS limit grows. Each handler waits longer in the rate-limit queue. Saw this empirically when we bumped to 256 — handler p99 went 4 s → 32 s. More concurrency makes individual handlers slower under contention.
-
Hand-rolled priority queue per cluster. Doesn’t address the cycle barrier. Adds complexity.
-
Remove
wg.Waitbut keep per-cycle workers. Each cycle still creates and destroys 256 goroutines. Avoidable scheduler / GC overhead. Persistent pool is strictly better.
Validation
- Local
go test ./pkg/shard/...andmake scalepass. dev-5kkind run reaches steady-state SLO (single-cluster local check that the queue/idempotency path works at small scale).scaleway-50kcloud re-run: expect chain throughput ≥ 80 binds/sec sustained, steady-state internal binding latency p99 ≤ 15 s (ADR-0020 budget).