ADR-0003: Shard inventory snapshots are eventually consistent on the cycle hot path

Status: Superseded by M44.4 Drop A — the shard cycle switched to synchronous Snapshot(), making the background fold goroutine redundant; fold goroutine and live triple-indexes removed at M66.1.

Date: 2026-05-02

Context

The shard’s runCycle reads an *inventory.Snapshot once per cycle and feeds it to reconcile, Phase 1, Phase 2, and Phase 3. Pre-M11.19, every cycle synthesised a fresh snapshot under the inventory’s read lock by walking byID and rebuilding the per-state and per-(state, instance-type) index slices. At 500K machines that walk dominated the per-cycle compute (BenchmarkShardCycle_Steady showed ≈700 ms of the cycle was the snapshot build, on M5 Max).

The snapshot is read-mostly. The cycle reads it; writes happen on the hot path through Insert / Apply / Remove (driven by reconcile, execute, and the post-RPC applyTransition). The build cost is O(N) regardless of how few machines actually changed since the previous snapshot.

Three candidate shapes were considered when this decision was made:

Synchronous fold on every read. The pre-M11.19 status quo: Snapshot() builds a fresh O(N) view on every call. Always fresh; always pays full cost.
Lazy fold with threshold. Track a “dirty count” of writes since the last fold; rebuild only when it crosses a threshold; otherwise return the cached snapshot. Bounded staleness; threshold has to be tuned.
Background fold with debounce. A goroutine watches a signal channel that writes ping; it folds on debounce intervals and publishes the result through atomic.Pointer[Snapshot]. The cycle reads the pointer in O(1). Bounded staleness by foldDebounce + buildTime; no tuning required beyond the debounce.

We chose Option 3.

The correctness question is whether eventual consistency on the cycle is safe. The safety net is inv.Apply’s state-machine validation: if the cycle decides on a stale snapshot and emits an action against a machine that has already moved, applyTransition rejects the re-attempt. Phase 1, 2, and 3 are all idempotent against an unchanged snapshot — re-deriving them on the next cycle costs nothing if the prior cycle’s actions already landed. So the cycle tolerates any finite-bounded staleness, not just small staleness.

Decision

The inventory exposes two snapshot APIs with different freshness contracts:

Inventory.Snapshot() — synchronous, fresh, O(N). Builds a new snapshot under the read lock and returns it. Updates the cached pointer as a side effect. Tests and any caller that needs strict consistency with the most recent write use this.
Inventory.CycleSnapshot() — atomic.Pointer load, O(1). Returns the most-recently-folded cached snapshot. May be stale by up to foldDebounce + buildTime (default 250ms + buildTime). The shard’s cycle hot path is the only intended caller.

A background goroutine is started in inventory.New (foldLoop). Writes through Insert / Apply / Remove send a non-blocking signal on a buffered foldChan after releasing the inventory lock. The fold goroutine drains the channel with a debounce window, builds a fresh snapshot under the read lock, and publishes it through atomic.Pointer.

inventory.Stop() shuts the goroutine down deterministically; the test setup that wants strict consistency in the same cycle as a write calls Snapshot() (synchronous) instead.

The shard’s runCycleCapturing uses CycleSnapshot(). No other production caller currently uses it, and adding new callers should be a documented decision.

Consequences

Per-cycle snapshot read is O(1). The ~700 ms snapshot build at 500K is no longer paid per cycle on the hot path.
The cycle is eventually consistent against inventory writes. Acceptable because every action emitted by Phase 1/2/3 is idempotent and applyTransition rejects illegal re-attempts. The next cycle re-derives anything missed.
Two APIs, deliberate split. Tests and code that reads-after-write keep Snapshot(). Mixing the APIs is a footgun (a test that writes then CycleSnapshot()s will see stale data and look broken); the split is documented and the test convention is to call Snapshot() once before any cycle invocation.
Foreground responsibilities are minimal. Writers send a non-blocking signal; the goroutine does the work. No write path can be blocked by the fold.
foldDebounce is a knob. 250 ms is the default; under sustained churn it bounds peak fold-CPU at 4 folds/sec at the cost of up to 250 ms staleness. If a future workload needs tighter freshness, lower the debounce; if fold CPU becomes a concern, raise it. The knob is not exposed via Config today — production callers don’t tune it.
Restart behaviour is the obvious thing. A fresh Inventory starts with an empty cached snapshot; the first Snapshot() call seeds the cache. Tests rely on this for warm-up.
The synchronous API is not deprecated. It is the freshness-strict path and remains a first-class API. This ADR does not make it less canonical — it adds a second API for a different consumer.