| 1 | Accepted | Record architecture decisions |
| 2 | Accepted | Coordinator topology: single region |
| 3 | Superseded | Shard snapshot: eventual consistency on the cycle hot path |
| 4 | Accepted | Incremental reconcile via since-revision |
| 5 | Accepted | Provider boundary is the validation point |
| 6 | Accepted | Shard self-registers via heartbeat |
| 7 | Accepted | Cluster-to-shard binding is operator-chosen at deploy time |
| 8 | Amended by ADR-0048 | Coordinator admin RPCs are leader-only and unauthenticated in v1 |
| 9 | Accepted | Reclaim uses policy/v1 eviction and async drain |
| 10 | Accepted | Minimum Kubernetes version 1.31 |
| 11 | Accepted | Bootstrap template is Helm values text template |
| 12 | Accepted | Helm charts published to GHCR as OCI artefacts |
| 13 | Accepted | Demand-to-inventory regimes and SLOs |
| 14 | Accepted | SLO posture: binding latency, not cycle wall-clock |
| 15 | Accepted | Realistic archetype improvements |
| 16 | Accepted | NodeStateUpdate carries node identity |
| 17 | Accepted | Per-CR binding latency vs fingerprint fanout |
| 18 | Accepted | Internal vs user-facing binding latency |
| 19 | Accepted | Phase 1 cloud vs bench discrepancy |
| 20 | Accepted | Internal binding latency SLO respects rollup interval |
| 21 | Accepted | Persistent execute pool |
| 22 | Accepted | Need.Count semantics — Pod count vs machine count, and where packing lives |
| 23 | Accepted | Real kube-scheduler in the scaletest harness, retire pod-shim’s binding role |
| 24 | Accepted | Co-location via podAffinity — the CoLocation CR field, roll-up aggregates |
| 25 | Accepted | The load-driver anchors sameRack groups — a gang-scheduler stand-in |
| 26 | Accepted | The scaletest harness must model the Speculative tier |
| 27 | Accepted | Roll-up demand is a constrained aggregate resource request, not (per-pod-shape, count) |
| 28 | Accepted | Cycle-p99 SLO is regime-parametric; the realistic catalog scales with Need cardinality |
| 29 | Accepted | Phase 1 Omega-style OCC — shared-state, commit-broker priority, dual-mode commits |
| 30 | Proposed | Incremental Phase 1 — delta-only processing as a layered optimization |
| 31 | Proposed | ParSync-style partitioned synchronization — conditional follow-on for raised per-shard ceilings |
| 32 | Accepted | Realistic catalog production-calibrated workload distribution |
| 33 | Rejected | Phase 1 supply-credit must respect bind readiness, not just provider state — superseded by ADR-0035 |
| 34 | Accepted | Scaletest is bring-your-own-substrate |
| 35 | Accepted | Scaletest SLOs are measured at steady state under churn, not at ramp |
| 36 | Accepted | Phase 3 reclaim must not fire before a cluster’s first rollup has arrived |
| 37 | Accepted | Scaletest catalog node-affinity dimensions must be realistic — drop synthetic team/app label axes |
| 38 | Accepted | Scaletest workloads are controller-managed objects (Deployment / StatefulSet), not bare Pods |
| 39 | Accepted | One CapacityRequest per Pod — not per unschedulable Pod; the demand signal must be total, not unmet |
| 40 | Accepted | Same-domain attribution is unified — every supply-crediting site is domain-aware |
| 41 | Accepted | Sub-machine Same-Needs fold into atomic aggregates — Same is for cross-machine topology |
| 42 | Accepted | Unsatisfiable-regime Same-domain choice is sticky at equal coverage — switch only for strictly greater |
| 42a | Accepted | ADR-0042 Addendum: aged acquisition parking — group identity on the wire, park after 8 unsatisfiable cycles, re-probe every 32 |
| 43 | Accepted | Harness-observed triggers get a demand-realism check before mechanism ships |
| 44 | Accepted | Seed machine pools are sized by machine demand (pod share ÷ packing density, gang-aware per-zone floors), not workload weight |
| 45 | Accepted | Capacity counts for a cluster iff bound — Phase 3 reclaims on demand shrinkage only; BigFleet never models packing (author decision; supersedes its own first draft) |
| 46 | Accepted | Actuation safety rails — per-cluster reclaim blast-radius cap, empty-roll-up quarantine, global kill switch |
| 47 | Accepted | Coordinator quorum formation by ordinal join; offline snapshot restore as single-voter recovery |
| 48 | Accepted | Opt-in file-based mTLS with bigfleet:// URI SAN identity binding — supersedes ADR-0008’s transport posture |
| 49 | Accepted | Idle→Speculative release (paper §8’s other half) — per-CapacityType idle holds inside Phase 3; the hold window is the rail, not a cap |
| 50 | Accepted | Realism catalog (realistic.yaml) calibrated to a realistic MACHINE fleet via per-archetype node-packing density; GPU inference densified (8/node), training whole-machine (1); amends M66.2 + ADR-0044 (author decision) |
| 51 | Accepted | Same-domain choice follows THIS gang’s bindings (gang-granular attribution) — record Need.Group on the binding, break capped-coverage ties on gang-own coverage; refines ADR-0045, fixes M77g (author decision) |
| 52 | Accepted | The shard counts its own in-flight provision commitment against the deficit — credit attributed Creating machines in the coverage walk; amends ADR-0045’s “no in-flight discounting” one state earlier, fixes the #66/#74 pre-Configuring runway over-acquire (author decision) |
| 53 | Deferred | Two-axis machine-state model (provisioned × bound + op annotation) — scouted as an alternative to ADR-0052 and judged worse for the over-acquire (doesn’t fix it; 149-ref blast; raises correctness surface); deferred as a standalone future ergonomics initiative, wire-frozen, post-ladder (author decision) |
| 54 | Accepted | Steady pod-bind SLO reframe under an uncapped real scheduler — release gate moves off the end-to-end pod-bind p99 (uncapped-scheduler / reprovision-bound, not BigFleet’s deliverable) onto BigFleet’s capacity-delivery hops (configure-phase p99, Bootstrap success ratio, node-state-update p99, shortfalls==0) plus a loose end-to-end p50 liveness floor; the end-to-end p99 becomes informational (author decision) |
| 55 | Proposed | Coordinator-driven cross-shard rebalancing (realises bigfleet.md §9: transfer idle → reassign quota → cross-shard preempt) — a leader-only tiered rebalancer + the three stub handlers made real, reusing the M20/M69 drain path; anti-oscillation via cooldown + demand-pull invariant; machine-ids donor-resolved, ownership via shard-local persisted owned-set (author decided to BUILD not remove, 2026-06-19; Proposed pending staged-build greenlight) |
| 56 | Accepted | Coverage credit gated on observed node readiness — Option A (provider-contract obligation): Configure must not report Configured until the node is observed Ready, enforced by a new conformance cluster-join scenario (no shard change); closes the S1 silent false-Configured → phantom-capacity hole that bootstrapSuccessRatio (reported failures) and ADR-0033 (ramp throughput) do not cover (author decision) |
| 57 | Accepted | P0: shard emits NodeStateUpdate on reconcile-observed transitions + resyncs node state on operator (re)connect — notifyNodeState fired only from the worker/applyTransition path, so async (providerkit) providers, which reach terminal Configured via reconcile, were invisible to the operator (workload never schedules); the in-process fake masked it and the assumed reconnect resync was never built. Shard→operator only, static stability preserved (author decision) |
| 58 | Accepted | Shard→provider fencing high-water mark is per (shard_id, machine_id), not per shard_id — a single live shard’s concurrent execute pool draws monotonic sequence numbers but races the sends, so a per-shard mark fenced the shard against its own out-of-order arrivals on different machines (false zombie → ~30/120 machines bricked at execute-concurrency 32). Per-machine keying stays monotonic (shard serializes per machine) while letting concurrent cross-machine ops proceed; a true zombie is still caught on epoch. Dir 3 (serialize stamp+send) refuted (server-side goroutine race). Contract + conformance (B302 broadened) + snapshot-format change; surfaced by bigfleet-demo (author decision) |
| 59 | Accepted | P0: async-provider drain finalizes via reconcile — executeDrain applied the terminal binding-clear (Cluster/Assigned* = "") onto the transitional Draining ack an async (providerkit) provider returns, setting Draining-without-a-cluster and tripping the invariant → every Reclaim/Preempt drain failed, capacity never released. Fix: clear only on terminal Idle (mirroring executeDelete); the async Draining ack is left Draining-with-cluster and finalized via the ADR-0057 reconcile path, which also clears Assigned* on a transition to an unbound state. Fake gains DrainStaged to model it. Shard-local, sync path byte-identical; third bigfleet-demo async gap (author decision) |
| 60 | Accepted (ListQuotas/ListProviders later removed as dormant scaffolding) | A read-only coordinator SAN role (bigfleet://readonly) + general-purpose read RPCs — splits the coordinator’s authenticated surface so read RPCs (ListShards/ListDomainAssignments/ListQuotas/ListProviders/ListShardReports) accept bigfleet://readonly OR admin while mutating RPCs stay admin-only; a read-only dashboard/CLI cert then can’t change the fleet (closes the K8s-Dashboard over-privileged-read footgun). Adds ListShardReports (leader-local soft-state snapshot per shard: ShardSummary + top-N Shortfall, carries received_at) and ListProviders. General-purpose, no hot-path dependency; amends ADR-0048; motivated by bigfleet-web-dashboard (author decision) |
| 61 | Accepted (amended 2026-06-28: matching-supply cardinality + preemption-summary + same-candidate decision-context fields) | A shard-side read-only needs-inspection RPC — the only surface that can answer “which of a cluster’s needs are satisfied vs unmet, and why”, because the live NeedsTable lives in the shard and the coordinator only holds an aggregated/anonymous/requirements-stripped top-100 shortfall ledger. New readonly-gated (bigfleet://readonly, mirrors ADR-0060) streaming, per-cluster-filtered RPC on a dedicated read-only service on the shard’s gRPC server, returning per-Need last-cycle verdicts (satisfied / residual-deficit vector / claimed counts / Same domain + satisfiability / acquisition-parked / unmet_reason); retained as a trimmed projection behind a build-then-swap RWMutex at the existing recordShortfalls capture point. Static-stability-safe (read of retained shard-local state, no coordinator import). Reason taxonomy is two-tier: SATISFIED + TOPOLOGY_UNSATISFIABLE(Same) are pure retain (no engine change); PRIORITY_STARVED/NO_MATCHING_SUPPLY/PREEMPTION_EXHAUSTED need cheap behaviour-preserving OCC/Phase-2 instrumentation — author built both tiers now. General-purpose (CLI + dashboard consumers); motivated by the bigfleet-web-dashboard needs explorer (author decision) |