ADR-0005: The provider boundary is the validation point; reconcile trusts domain types
Status: Accepted
Date: 2026-05-02
Context
The shard’s provider.Provider interface returns domain types (pkg/machine.Machine), not proto messages. Two implementations live behind the interface today: the in-tree pkg/provider/fake (constructs machine.Machine values directly in-process) and pkg/provider/grpcadapter (wraps a real gRPC CapacityProvider client and converts proto → domain via pkg/conv on every call).
Pre-M11.24a, pkg/shard/reconcile.go re-routed every machine returned by provider.List through conv.MachineToProto + conv.MachineFromProto before applying it to inventory. The comment justifying this read:
MachineFromProto round-trips through the proto-shaped Machine from the provider; here the provider is already returning domain types, so the conversion is a no-op clone (we still route through MachineToProto + MachineFromProto to get the validation paths exercised on every reconcile).
In other words: belt-and-braces validation. The comment was honest about it being redundant.
The cost: at 500K inventory and a post-execute cycle pulling 1000 deltas, that’s 1000 × {proto allocation, label-map clone, resource-map clone, MachineFromProto allocation, second label/resource clone, validation walk}. Per-machine wall-clock is small but per-cycle aggregate showed up as ~70 ms even after the M11.22 incremental reconcile reduced the working set to deltas.
The validation re-run was load-bearing for two scenarios that no longer apply:
- A future in-tree provider that constructed invalid
machine.Machinevalues directly. Defensive. We have no such provider and theprovider.Providerinterface contract is “return valid domain types.” Adding a misbehaving in-tree provider is a code-review concern, not a runtime concern. - Catching
grpcadapterregressions where the proto→domain conversion produced an invalid machine. Defensive. The grpcadapter validates exactly once at the conversion boundary; that’s the right place. Re-validating in reconcile catches the same bugs the unit tests forpkg/convalready catch.
Decision
The provider boundary is the validation point. reconcile trusts that provider.Provider.List returns valid machine.Machine values and applies them directly to inventory without round-tripping through proto.
Validation responsibilities are explicit:
pkg/provider/fakevalidates by construction (the in-process fake createsmachine.Machinevalues from typed inputs; invalid combinations are caught at write time, not at read time).pkg/provider/grpcadaptervalidates at the proto→domain conversion viapkg/conv(MachineFromProtoperforms state validation and shape checks). This conversion happens once per RPC response, not once per reconciled machine on the shard side.pkg/shard/reconcileassumes the input is valid and applies it to the inventory. State-machine correctness on the apply path is enforced byinventory.Apply(which validates state transitions throughmachine.CanTransition); that’s the safety net for “bad data slipped through anyway.”
The early state-match short-circuit in applyReconciledMachine (M11.24a) is the other half of this change: when the local inventory already has the machine in the same state as the provider returned, do nothing. This is the common case after executeBootstrap / executeProvision / executeDrain apply the post-RPC ack.Machine state to inventory locally — the next reconcile sees state-match and returns immediately, paying only an inv.Get per delta machine.
Consequences
- Reconcile per-delta cost dropped from “two map clones + struct copies” to “one
inv.Get+ state compare”. Phase-dump on M5 Max showed the total cycle mean drop 153 ms → 116 ms (−24 %) at 500K, dominated by the round-trip elimination. - Single source of truth for machine validation. New providers register their validation at the boundary they own. The shard does not double-check.
- Adding an in-tree provider that returns malformed machines is now a real bug, not a no-op. The defence-in-depth was implicit; removing it makes the interface contract explicit. New in-tree providers (we expect zero — real providers live in separate repos) must validate inputs themselves.
pkg/convis unchanged and still used in the execute path. Reconcile is the only caller that dropped the validation round-trip; execute (pkg/shard/execute.go) still routes throughconv.MachineFromProto(conv.MachineToProto(ack.Machine))to materialise the post-RPC machine snapshot. That call site is rare (one per executed action, not one per inventory entry) and the per-RPC clone is a useful boundary for the assigned-* field merge that follows. Future cleanup may consolidate that as well; this ADR does not pre-empt it.- Conformance suite implication. The
provider.ProviderGo-interface implementations that satisfytest/conformance/must validate at the boundary; the conformance suite should grow a positive test that an out-of-tree provider’s invalid output is rejected at the gRPC adapter, not silently propagated.