ADR-0061: A shard-side read-only needs-inspection RPC (per-Need last-cycle verdict)
Status
Accepted, 2026-06-26 (author decision, Lucy Sweet). Author sign-off resolved the two open scope questions: build the full reason taxonomy (Tier 1 + Tier 2) now, and place the reads on a dedicated read-only service on the shard’s gRPC server. Builds on ADR-0027, ADR-0042, ADR-0048, ADR-0060; relates to ADR-0029.
Context
Operator tooling — a CLI, alerting, capacity planning, and the bigfleet-web-dashboard — wants to answer, for one cluster: which of my needs are satisfied vs unmet, by how much, and why? Today nothing can.
A cluster’s demand lives as the shard’s in-memory NeedsTable (pkg/needs): aggregated, bucketed, full-replacement-per-cluster Need rows ({ClusterID, Profile, AggregateResources vector, MinUnit, Group, AcquisitionParked}; the Profile is the (requirements, spread, priority, interruption_penalty_bucket, reclamation_penalty_bucket) aggregation key — ADR-0027). Every cycle the decision engine computes, per Need, the full verdict — satisfied/unmet, residual deficit vector, claimed machines, chosen Same domain, satisfiability, parked state — and then discards it. What leaves the shard is deliberately lossy: a fingerprint-aggregated, cluster-anonymous, requirements-stripped top-100 shortfall ledger plus shard-aggregate Prometheus gauges, surfaced to the leader via ReportShard and re-exposed by ListShardReports (ADR-0060).
So the three read surfaces an out-of-tree, read-only consumer has each fall short of “explore a cluster’s needs”:
CapacityRequestCRDs (per managed cluster) — the operator’s raw pre-aggregation input (one CR per pod), declared demand only, statusPending|Acknowledged(receipt, not satisfaction). Shows what a cluster asked for, never what the shard did.- Prometheus — shard-aggregate, no
cluster_idon any demand series. ListShardReports— unmet-only,requirementsstripped, nocluster_id, fingerprint-aggregated.
The satisfied majority of demand, and the engine’s per-Need verdict, are observable nowhere. This is a genuine, general operator question (not a dashboard artifact), so the fix belongs in core as general-purpose tooling — consistent with ADR-0060, which put the read-only role and read RPCs in core and treated the dashboard as one consumer.
Decision
Add a shard-side, read-only, streaming needs-inspection RPC that returns the per-Need last-cycle verdict the engine already computes, plus the rows it currently discards.
1. Owner: the shard, never the coordinator
The data lives in the shard, and pkg/shard must not import pkg/coordinator (static stability, enforced by TestStaticStability_ShardDoesNotImportCoordinator). A coordinator route is doubly wrong: it would couple the shard upward, and the coordinator only retains the aggregated/anonymous/stripped top-100 shortfall ledger — the satisfied majority and per-cluster attribution were never sent up and cannot be derived leader-side (the proto already says “query the shard directly for the full requirement set”). So this is a new shard read surface, not a leader-local derivation.
2. Transport, gating, discovery
- A new read-only service registered on the shard’s existing gRPC server (which today serves only the operator-initiated
Shard.Sessionbidi stream). A secondRegisterXServeron the samegrpc.Serveris straightforward; a separate service keeps the cluster-SAN-gatedSessionand the readonly-gated reads cleanly delineated. - Gated by
bigfleet://readonly(orbigfleet://admin), mirroring the coordinator’srequireReadIdentity(ADR-0060); skipped on plaintext (the trust-the-network default, ADR-0048). The shard’s mTLS server already doesRequireAndVerifyClientCert; identity is enforced per-RPC in the handler (a small new shard-side helper, mirroring the coordinator’s). - Discovery: a client lists shards on the coordinator (
ListShards, which carries each shard’s advertisedaddressand is already readonly-gated) and dials the shard’s read RPC directly with the same readonly cert.
3. Shape: streaming, per-cluster-filtered, trimmed
- Server-streaming, with a mandatory-or-cheap per-cluster filter (the NeedsTable is already
byCluster-indexed, so the filter is free) and pagination. A large shard holds ~50K Need rows alongside its ~500K-machine inventory (the shard scale ceiling,docs/plan.md§3.4); a full unfiltered dump is the wrong default. - Each row is a
NeedView: the existingCapacityNeedwire message (it already carriesrequirements / aggregate_resources / min_unit / priority / spread / both penalty buckets / group) +cluster_id,arrival_unix_nanos,profile_fingerprint, and the last-cycle verdict below. It carries a cycle number + timestamp; an empty result means “not yet computed / rebuilding”, not “no demand”. - Trimmed projection only. The verdict excludes the per-Need
[]machine.IDlists (ClaimedMachinesetc.) — across ~50K needs those approach the shard’s whole claimed set and would be a material allocation/GC load. We keep counts, not ID lists. (An opt-in “include machine IDs for this one cluster” flag is a possible later extension.)
4. The verdict — and the reason taxonomy (the load-bearing decision)
The per-Need verdict: satisfied, residual_deficit (vector), claimed_machine_count, bootstrap_count, provision_count, same_domain, same_satisfiable, acquisition_parked, parked_age_cycles, age_cycles_unmet, and unmet_reason.
A feasibility audit established that the reason splits into two tiers, and the ADR will not paper over the difference:
Tier 1 — pure retain + serve (no decision-logic change). Derivable from state the engine already computes and discards:
SATISFIED—NeedResult.Unsatisfied == false.TOPOLOGY_UNSATISFIABLE(forSame/co-location) —Unsatisfied && hasSame && !SameSatisfiable(NeedResult.SameSatisfiable, corroborated byAcquisitionParked, ADR-0042).UNMET_OTHER— unsatisfied and notSame-unsatisfiable: a single coarse bucket covering “ran out of supply / starved by higher priority / preemption fell short”.
Tier 1 also requires one additive, no-logic-change plumbing edit: surface occ.CycleResult.Results on Phase1Result so satisfied plain Needs (not just gangs and unsatisfied rows) get a verdict row. The retention itself is a cheap capture at the existing recordShortfalls point in runCycle, where p1/p2 are already live.
Tier 2 — split UNMET_OTHER into PRIORITY_STARVED / NO_MATCHING_SUPPLY / PREEMPTION_EXHAUSTED. These three are not separable from any retained struct today; the distinguishing facts are computed as locals inside FindBasic/FindSame (did any MatchProfile+MinUnit-passing machine exist in any state, regardless of claim/precedence?) and inside Phase 2 (was the victim pool provably empty? did picks fall short of the deficit?) and then dropped. Surfacing them is cheap, behaviour-preserving instrumentation — new observational booleans/counts that do not change which machines get claimed — but it is still a hot-path code edit to the OCC find path and Phase 2, and it needs scale re-validation (allocation/GC inside the cycle SLO). Author decision (2026-06-26): build Tier 2 now — the full taxonomy ships in this ADR, with the Tier-2 instrumentation guarded as observation-only (no change to claim selection) and re-validated on a scale rung before release.
(Known limitation either way: a DoNotSchedule TopologySpread that can’t be placed isn’t separately derivable today — it would fall into a supply reason. Documented; foldable into a SPREAD_UNSATISFIABLE reason later if needed.)
5. Retention mechanics
A needViewLedger rebuilt once per cycle at the recordShortfalls capture point, behind an RWMutex with build-then-swap (pre-build the snapshot, swap the pointer under the lock) — not the shortfall ledger’s rebuild-under-lock, to keep the per-cycle write critical section O(1) against ~50K rows. The ledger stores the trimmed projection (copying projected fields, not retaining NeedResult.Need pointers, which would pin the whole demand slice).
Consequences
- Operators can finally explore per-cluster needs. Satisfied-vs-unmet, residual deficit, parked gangs, and (Tier 1) the
Same-unsatisfiable reason — directly attributable to a cluster, from a CLI / alerting / the dashboard, with abigfleet://readonlycert that cannot mutate the fleet. - General-purpose, not dashboard-specific. The surface is a shard read RPC any operator tool consumes; the dashboard is one client. (Same posture as ADR-0060.)
- Static-stability-safe. A read of retained shard-local state: no
pkg/coordinatorimport, no coordinator hot-path dependency, no write path. Tier 1 adds no new hot-path computation; Tier 2 adds observational field writes only (behaviour-preserving). - Honest costs the build must own: (a) a trimmed per-Need snapshot at ~50K rows is a few MB rebuilt each cycle — a new steady-state allocation to reflect in the scale profiles and watch on
uber-50k/uber-500k; (b) the read RPC must paginate/stream + per-cluster-filter; (c) it is last-cycle soft state — stale by one cycle, empty before the first cycle and (if ever leader-coupled, which it is not) after a restart. - Reason completeness is a choice. Ship Tier 1 (coarse
UNMET_OTHER) with zero engine-logic change, or Tier 1 + Tier 2 (full taxonomy) accepting the cheap hot-path instrumentation + scale re-validation. See “Open decisions”.
Decisions (resolved at acceptance) + implementation notes
- Reason scope — RESOLVED: Tier 1 + Tier 2 (full taxonomy) now. The Tier-2 OCC/Phase-2 instrumentation is observation-only and must be proven not to change claim selection (golden-output test against the existing engine fixtures).
- Service placement — RESOLVED: a dedicated read-only service on the shard’s gRPC server, separate from
Shard.Session, readonly-gated. - Demand realism (ADR-0043) — the satisfied-vs-unmet-with-reason view is the explicit operator need; recorded.
- Staleness/cost contract (implementation) — last-cycle (one-cycle-stale) is the contract; the RPC stamps cycle number + timestamp and labels “empty == not yet computed”. Pagination page size + read cadence settled during implementation against the 50K-rows/shard regime.
- Scale gating (implementation) — the retained trimmed snapshot is a new steady-state allocation; reflect it in the scale profiles and re-measure on a rung (
uber-50kminimum) before release, per “scale ceilings as we go”. Tier-2 instrumentation re-validated on the same rung.
Alternatives considered
- Reconstruct the NeedsTable in the dashboard from
CapacityRequestCRDs by replaying the operator’s aggregation transform. Rejected as the primary mechanism: it is declared pre-aggregation demand with skew, shows zero shard outcome (no satisfied/unmet), and version-locks the dashboard to the operator’s bucketing. (Still worth shipping as a complementary “declared demand” view — it needs no core change — but it is not the needs explorer.) - Extend
ListShardReportson the coordinator. Rejected: the coordinator never receives the satisfied rows, per-cluster attribution, or requirements; deriving them leader-side is impossible.
Amendment (2026-06-28): decision-context fields for the debugging surface
The dashboard’s Needs view is the most important table in the project; surfacing
only the verdict + counts told an operator that a need was unmet, not enough to
debug why. This amendment adds three observation-only fields to NeedView,
each retaining a small bounded summary the engine already computes each cycle but
discarded at the barrier — the same posture as the original MatchingSupplyExists:
matching_supply(per-state cardinality).occ.MatchingSupplyExistswas a bool from a scan that early-returned on the first match; it now also counts matching machines per state (idle/configured/speculative), capped atmatchingSupplyCapper state (cappedflags a hit). Cheap for the common pinned shape (instance-type bucket sums); capped walk for unpinned. Quantifies theNO_MATCHING_SUPPLY(all zero) vsPRIORITY_STARVED(exist but held) split.preemption(victim summary). Phase 2 already walks/picks victims for an unmet preemptor; it now retains{victims_found, capacity_freed}on the unresolved need (still-short is the existing deficit) — thePREEMPTION_EXHAUSTEDdetail.same_candidates(top-K domain coverage). The Same pre-pass already builds per-domain coverage buckets and chooses one; it now retains the topsameCandidateCap{domain, coverage_per_mille, satisfiable}(best-first) so theTOPOLOGY_UNSATISFIABLEverdict shows why the chosen domain won and how short the runners-up were.
It also populates the previously-defined-but-unwired parked_age_cycles.
Hot-path / static-stability review: all three are computed at/after the
existing cycle barrier, bounded by caps, feed no claim/Action decision, and add
no pkg/coordinator dependency — identical to the instrumentation this ADR
already accepted. The “which higher-precedence Need took my machine” displacement
chain is deliberately not retained (rebuilt each cycle, discarded at the
barrier, a per-machine audit trail the roadmap rules out); the dashboard
approximates it client-side from the full needs set.