ADR-0059: Async-provider drain finalizes via reconcile — the post-Drain clear lands on terminal Idle, not the transitional Draining ack
Status
Accepted, 2026-06-23 (author decision, Lucy Sweet). P0 fix.
Context
Against a real async (providerkit) provider on --provider-addr, every Phase-3 Reclaim and Phase-2 Preempt drain failed, and cloud capacity could never be released — after demand dropped, the fleet stayed stuck holding nodes (action_execute_outcomes_total{kind="Reclaim",outcome="transition_error"} climbing, machines stuck Draining, Idle/Speculative never refilling). The shard log repeated drain: post-Drain transition: ... Draining state must have a cluster.
This is the third async-provider gap the bigfleet-demo real out-of-tree provider exposed (after ADR-0057’s reconcile→operator notify and ADR-0058’s concurrency↔fencing), and the same root blind spot each time: the in-process scaletest fake is synchronous, so the real providerkit path is the first to hit it.
The mechanism (pkg/shard/execute.go executeDrain):
applyTransition(StateDraining, nil)—Configured → Draining, cluster preserved (valid:Drainingmust have a cluster,pkg/machine/machine.go).Provider.Drain(...)— the async kit returns theTransitionAckin its transitional state (Draining) immediately and runs teardown out-of-band;ack.Machine.State == Draining, not terminalIdle.applyTransition(drained.State, func(m){ m.Cluster=""; m.Assigned*=0 })— withdrained.State == Draining, this setDraining-without-a-cluster, tripping the invariant → the transition errored → the drain never completed; the machine was leftDraining.
The terminal binding-clear is correct only for the terminal Idle state. The synchronous fake masked it: its Drain ack carries State=Idle, so step 3 ran applyTransition(Idle, clear), which is valid (Idle needs no cluster). executeDelete had already got this right — it gates its clear on the terminal Speculative state — so executeDrain was the lone unported site. Both Reclaim and Preempt route through executeDrain, so both failed.
A second, quieter half: applyReconciledMachine (pkg/shard/reconcile.go) unconditionally preserves Assigned* from the existing record across a state change (the ADR-0057 / restart-rebuild path — the provider view carries assignment only as the shard_metadata echo, which reconcile nils). So a drain reaching Idle via reconcile would keep stale priority/penalty on the now-unbound machine.
Decision
Make the drain finalize on the terminal state, exactly as executeDelete and the ADR-0057 async-configure path do:
-
executeDrainclears only onIdle. Gate the binding-clear ondrained.State == machine.StateIdle. A synchronous provider lands directly atIdleand clears as before. An async provider returns aDrainingack — the machine is alreadyDraining-with-cluster (step 1), so we leave it and let the terminalIdle, observed viaapplyReconciledMachine, finalize the drain (the same reconcile path ADR-0057 taught to emit the node-state update, so the operator learns the node drained). -
Reconcile clears assignment on a transition to an unbound state. In
applyReconciledMachine, preserveAssigned*only while the machine stays bound (Configured/Configuring/Draining); on a transition toIdle/Speculative(an async drain or delete completion) leave the incoming zero values, so the binding and its assignment clear together. Bound→bound transitions (the async-configureConfiguring → Configured) still preserve, unchanged.
The fake gains a DrainStaged option (the teardown twin of ConfigureStaged/CreateStaged): Drain returns Draining and the binding clears only at the CompleteStaged-driven terminal Idle. This lets the in-process fake model the async-drain path the synchronous default masked, so the class is now unit-testable in-process.
Consequences
- Phase 2/3 can release capacity against async out-of-tree providers. Reclaim and Preempt drains complete (
Configured → Draining → [reconcile] → Idle);Idle/Speculativerefill; the autoscaler can shrink, not just grow. - Symmetric with the async lifecycle. Drain now mirrors async Configure (
Idle → Configuring → [reconcile] → Configured) and Delete (terminal-gated clear). The worker dispatches; reconcile finalizes the out-of-band terminal transition and notifies the operator. - No stale assignment on drained slots. A drained
Idlemachine carries zeroedAssigned*, so the per-penalty-bucket inventory metric and anyAssigned*-keyed logic see it as the unbound slot it is. - Static stability preserved. Both changes are shard-local (execute path + reconcile ingest); no
pkg/shard→pkg/coordinatordependency. The synchronous path is byte-identical (instantDrain→Idle→ clears, exactly as before), so every existing scaletest/sim is unaffected. - Test-coverage gap closed at the harness layer.
DrainStaged+ the newexecuteDrainasync integration test and the reconcile-finalize unit tests exercise the async-drain path the fake previously couldn’t. Combined with ADR-0057/ADR-0058, this is the third instance of the same lesson: a real async provider in CI would have caught all three; that integration test remains the worthwhile standing follow-up.