Architecture Decision Records

ADR index

#	Status	Title
1	Accepted	Record architecture decisions
2	Accepted	Coordinator topology: single region
3	Superseded	Shard snapshot: eventual consistency on the cycle hot path
4	Accepted	Incremental reconcile via since-revision
5	Accepted	Provider boundary is the validation point
6	Accepted	Shard self-registers via heartbeat
7	Accepted	Cluster-to-shard binding is operator-chosen at deploy time
8	Amended by ADR-0048	Coordinator admin RPCs are leader-only and unauthenticated in v1
9	Accepted	Reclaim uses policy/v1 eviction and async drain
10	Accepted	Minimum Kubernetes version 1.31
11	Accepted	Bootstrap template is Helm values text template
12	Accepted	Helm charts published to GHCR as OCI artefacts
13	Accepted	Demand-to-inventory regimes and SLOs
14	Accepted	SLO posture: binding latency, not cycle wall-clock
15	Accepted	Realistic archetype improvements
16	Accepted	NodeStateUpdate carries node identity
17	Accepted	Per-CR binding latency vs fingerprint fanout
18	Accepted	Internal vs user-facing binding latency
19	Accepted	Phase 1 cloud vs bench discrepancy
20	Accepted	Internal binding latency SLO respects rollup interval
21	Accepted	Persistent execute pool
22	Accepted	`Need.Count` semantics — Pod count vs machine count, and where packing lives
23	Accepted	Real kube-scheduler in the scaletest harness, retire pod-shim’s binding role
24	Accepted	Co-location via podAffinity — the `CoLocation` CR field, roll-up aggregates
25	Accepted	The load-driver anchors sameRack groups — a gang-scheduler stand-in
26	Accepted	The scaletest harness must model the Speculative tier
27	Accepted	Roll-up demand is a constrained aggregate resource request, not `(per-pod-shape, count)`
28	Accepted	Cycle-p99 SLO is regime-parametric; the realistic catalog scales with Need cardinality
29	Accepted	Phase 1 Omega-style OCC — shared-state, commit-broker priority, dual-mode commits
30	Proposed	Incremental Phase 1 — delta-only processing as a layered optimization
31	Proposed	ParSync-style partitioned synchronization — conditional follow-on for raised per-shard ceilings
32	Accepted	Realistic catalog production-calibrated workload distribution
33	Rejected	Phase 1 supply-credit must respect bind readiness, not just provider state — superseded by ADR-0035
34	Accepted	Scaletest is bring-your-own-substrate
35	Accepted	Scaletest SLOs are measured at steady state under churn, not at ramp
36	Accepted	Phase 3 reclaim must not fire before a cluster’s first rollup has arrived
37	Accepted	Scaletest catalog node-affinity dimensions must be realistic — drop synthetic team/app label axes
38	Accepted	Scaletest workloads are controller-managed objects (Deployment / StatefulSet), not bare Pods
39	Accepted	One CapacityRequest per Pod — not per unschedulable Pod; the demand signal must be total, not unmet
40	Accepted	`Same`-domain attribution is unified — every supply-crediting site is domain-aware
41	Accepted	Sub-machine `Same`-Needs fold into atomic aggregates — `Same` is for cross-machine topology
42	Accepted	Unsatisfiable-regime `Same`-domain choice is sticky at equal coverage — switch only for strictly greater
42a	Accepted	ADR-0042 Addendum: aged acquisition parking — group identity on the wire, park after 8 unsatisfiable cycles, re-probe every 32
43	Accepted	Harness-observed triggers get a demand-realism check before mechanism ships
44	Accepted	Seed machine pools are sized by machine demand (pod share ÷ packing density, gang-aware per-zone floors), not workload weight
45	Accepted	Capacity counts for a cluster iff bound — Phase 3 reclaims on demand shrinkage only; BigFleet never models packing (author decision; supersedes its own first draft)
46	Accepted	Actuation safety rails — per-cluster reclaim blast-radius cap, empty-roll-up quarantine, global kill switch
47	Accepted	Coordinator quorum formation by ordinal join; offline snapshot restore as single-voter recovery
48	Accepted	Opt-in file-based mTLS with bigfleet:// URI SAN identity binding — supersedes ADR-0008’s transport posture
49	Accepted	Idle→Speculative release (paper §8’s other half) — per-CapacityType idle holds inside Phase 3; the hold window is the rail, not a cap
50	Accepted	Realism catalog (realistic.yaml) calibrated to a realistic MACHINE fleet via per-archetype node-packing density; GPU inference densified (8/node), training whole-machine (1); amends M66.2 + ADR-0044 (author decision)
51	Accepted	Same-domain choice follows THIS gang’s bindings (gang-granular attribution) — record Need.Group on the binding, break capped-coverage ties on gang-own coverage; refines ADR-0045, fixes M77g (author decision)
52	Accepted	The shard counts its own in-flight provision commitment against the deficit — credit attributed Creating machines in the coverage walk; amends ADR-0045’s “no in-flight discounting” one state earlier, fixes the #66/#74 pre-Configuring runway over-acquire (author decision)
53	Deferred	Two-axis machine-state model (provisioned × bound + op annotation) — scouted as an alternative to ADR-0052 and judged worse for the over-acquire (doesn’t fix it; 149-ref blast; raises correctness surface); deferred as a standalone future ergonomics initiative, wire-frozen, post-ladder (author decision)
54	Accepted	Steady pod-bind SLO reframe under an uncapped real scheduler — release gate moves off the end-to-end pod-bind p99 (uncapped-scheduler / reprovision-bound, not BigFleet’s deliverable) onto BigFleet’s capacity-delivery hops (configure-phase p99, Bootstrap success ratio, node-state-update p99, shortfalls==0) plus a loose end-to-end p50 liveness floor; the end-to-end p99 becomes informational (author decision)
55	Proposed	Coordinator-driven cross-shard rebalancing (realises bigfleet.md §9: transfer idle → reassign quota → cross-shard preempt) — a leader-only tiered rebalancer + the three stub handlers made real, reusing the M20/M69 drain path; anti-oscillation via cooldown + demand-pull invariant; machine-ids donor-resolved, ownership via shard-local persisted owned-set (author decided to BUILD not remove, 2026-06-19; Proposed pending staged-build greenlight)
56	Accepted	Coverage credit gated on observed node readiness — Option A (provider-contract obligation): Configure must not report Configured until the node is observed Ready, enforced by a new conformance cluster-join scenario (no shard change); closes the S1 silent false-Configured → phantom-capacity hole that bootstrapSuccessRatio (reported failures) and ADR-0033 (ramp throughput) do not cover (author decision)
57	Accepted	P0: shard emits NodeStateUpdate on reconcile-observed transitions + resyncs node state on operator (re)connect — notifyNodeState fired only from the worker/applyTransition path, so async (providerkit) providers, which reach terminal Configured via reconcile, were invisible to the operator (workload never schedules); the in-process fake masked it and the assumed reconnect resync was never built. Shard→operator only, static stability preserved (author decision)
58	Accepted	Shard→provider fencing high-water mark is per (shard_id, machine_id), not per shard_id — a single live shard’s concurrent execute pool draws monotonic sequence numbers but races the sends, so a per-shard mark fenced the shard against its own out-of-order arrivals on different machines (false zombie → ~30/120 machines bricked at execute-concurrency 32). Per-machine keying stays monotonic (shard serializes per machine) while letting concurrent cross-machine ops proceed; a true zombie is still caught on epoch. Dir 3 (serialize stamp+send) refuted (server-side goroutine race). Contract + conformance (B302 broadened) + snapshot-format change; surfaced by bigfleet-demo (author decision)
59	Accepted	P0: async-provider drain finalizes via reconcile — executeDrain applied the terminal binding-clear (Cluster/Assigned* = "") onto the transitional Draining ack an async (providerkit) provider returns, setting Draining-without-a-cluster and tripping the invariant → every Reclaim/Preempt drain failed, capacity never released. Fix: clear only on terminal Idle (mirroring executeDelete); the async Draining ack is left Draining-with-cluster and finalized via the ADR-0057 reconcile path, which also clears Assigned* on a transition to an unbound state. Fake gains DrainStaged to model it. Shard-local, sync path byte-identical; third bigfleet-demo async gap (author decision)
60	Accepted (ListQuotas/ListProviders later removed as dormant scaffolding)	A read-only coordinator SAN role (bigfleet://readonly) + general-purpose read RPCs — splits the coordinator’s authenticated surface so read RPCs (ListShards/ListDomainAssignments/ListQuotas/ListProviders/ListShardReports) accept bigfleet://readonly OR admin while mutating RPCs stay admin-only; a read-only dashboard/CLI cert then can’t change the fleet (closes the K8s-Dashboard over-privileged-read footgun). Adds ListShardReports (leader-local soft-state snapshot per shard: ShardSummary + top-N Shortfall, carries received_at) and ListProviders. General-purpose, no hot-path dependency; amends ADR-0048; motivated by bigfleet-web-dashboard (author decision)
61	Accepted (amended 2026-06-28: matching-supply cardinality + preemption-summary + same-candidate decision-context fields)	A shard-side read-only needs-inspection RPC — the only surface that can answer “which of a cluster’s needs are satisfied vs unmet, and why”, because the live NeedsTable lives in the shard and the coordinator only holds an aggregated/anonymous/requirements-stripped top-100 shortfall ledger. New readonly-gated (bigfleet://readonly, mirrors ADR-0060) streaming, per-cluster-filtered RPC on a dedicated read-only service on the shard’s gRPC server, returning per-Need last-cycle verdicts (satisfied / residual-deficit vector / claimed counts / Same domain + satisfiability / acquisition-parked / unmet_reason); retained as a trimmed projection behind a build-then-swap RWMutex at the existing recordShortfalls capture point. Static-stability-safe (read of retained shard-local state, no coordinator import). Reason taxonomy is two-tier: SATISFIED + TOPOLOGY_UNSATISFIABLE(Same) are pure retain (no engine change); PRIORITY_STARVED/NO_MATCHING_SUPPLY/PREEMPTION_EXHAUSTED need cheap behaviour-preserving OCC/Phase-2 instrumentation — author built both tiers now. General-purpose (CLI + dashboard consumers); motivated by the bigfleet-web-dashboard needs explorer (author decision)