Decision map — every ADR, where it lives in code, and what guards it
This is the implementation map for BigFleet’s architecture decisions: a single navigable page
that takes each ADR and answers where does this decision actually live in the tree, and which test
keeps it honest? It is the maintainer’s companion to the ADRs themselves — open this when you are
about to change code that an ADR constrains, and it will point you at the files and the *_test.go
that fail if you get it wrong.
It is not the canonical status table. That is ../adr/index.md — the single
source of truth for each ADR’s status line (Accepted / Proposed / Rejected / Superseded / Amended).
This page copies status lines faithfully but defers to the index; if they ever disagree, the index
wins and this page is stale. Likewise, the prose deep-dives under README.md explain the
why function-by-function; this page is the index into them, organised by decision rather than by
subsystem.
When a code path here disagrees with a higher authority, the higher authority wins and the
divergence is documented (not papered over). The source-of-truth ordering (see
../index.md’s “When the docs disagree”) is:
- The two papers —
../papers/bigfleet.md,../papers/fleet-scale-kubernetes.md. - Author decisions in
../adr/. ../plan.md.- The code.
A handful of decisions are spec-only: the ADR was Accepted as a framing or was Proposed/Rejected and never shipped mechanism, or its mechanism was later removed by a superseding ADR. Those rows say spec-only / superseded explicitly rather than inventing a code path. Trust that label — it means the code genuinely is not there.
Decision lineage
Most decisions stand alone. The ones that don’t form chains; reading the chain is the only way to understand why the current code looks the way it does (a row’s “implemented in” often reflects the last link, not the ADR you started from).
-
ADR-0003 (shard snapshot eventual consistency) — superseded, mechanism removed. The original background-fold goroutine +
CycleSnapshot()was reversed at M44.4 Drop A (synchronousSnapshot()) and the fold loop / live triple-indexes were deleted at M66.1. The ADR’s mechanism no longer exists in the tree;pkg/inventory/inventory.goandpkg/shard/shard.gocarry supersession comments. -
ADR-0008 transport posture — superseded by ADR-0048. ADR-0008’s leader-only RPC contract stands. Its “v1 ships unauthenticated, wrap it in a sidecar” posture is replaced by ADR-0048’s opt-in file-based mTLS with
bigfleet://URI-SAN identity binding. -
ADR-0013 → ADR-0014 → ADR-0017 → ADR-0018 (the SLO arc). ADR-0013’s three-regime / cycle-p99 release gate was reframed by ADR-0014 (binding latency becomes the gate, cycle wall-clock a tracked envelope); ADR-0014 was built on by ADR-0017 (per-Pod histogram as the gate source) and amended by ADR-0018 (the harness metric is internal-only; renamed
internalBindingLatencyP99Seconds). ADR-0013’s named three-regime/convergence-rate scheme was never built — spec-only. -
ADR-0033 — Rejected, superseded by ADR-0035. The bind plateau ADR-0033 targeted was a kube-scheduler ramp property, not a BigFleet steady-state bug. No code shipped; ADR-0035 moved the fix to the harness (“measure SLOs at steady state under churn, not at ramp”) and was itself amended 2026-06-14 (reclaim settle-window + bounded floor, accepting the ADR-0021 async-actuation floor).
-
ADR-0045 supersedes its own first draft. The withdrawn first draft proposed operator-reported per-machine consumption (rejected as scheduler-shadowing). The accepted rule is the single attribution contract: capacity counts for a cluster iff it is bound to it. M68 (“single attribution”) dissolves into it.
-
The domain-attribution arc: ADR-0040 → 0041 → 0042 → 0042-addendum → 0045 → 0051. This is the longest chain and the one most worth reading end-to-end before touching
pkg/decision/occ/samebucket.go.- 0040 makes every supply-crediting site
Same-domain-aware; its Addendum chooses theSamedomain once per Need per cycle jointly over creditable + acquirable supply. - 0041 folds sub-machine
Same-Needs into atomic aggregates (Sameis for cross-machine topology only). - 0042 makes unsatisfiable-regime domain choice sticky at equal coverage.
- 0042-addendum engages the named escalation path: aged acquisition parking (group ID on the
wire,
parkAfterCycles=8,reprobeEveryCycles=32). - 0045 then changes the accounting rule underneath all of it (bound-vs-demand), removing the Bootstrap≈Reclaim oscillation class by construction.
- 0051 refines 0045 to gang granularity (
bigfleet.lucy.sh/assigned-group), pinning the domain choice and the claimed set to a fixed point through the bootstrap dwell.
The deep-dive prose for this arc is
domain-attribution.md(companion page; see alsophase1-occ.mdandneeds-table.md). - 0040 makes every supply-crediting site
-
ADR-0042-addendum is the parking cautionary tale. It is the single largest pool of unforced engine complexity — built rigorously against a demand shape one catalog archetype fabricated. That is exactly what ADR-0043 (“harness-observed triggers get a demand-realism check before mechanism ships”) exists to catch; ADR-0043 is codified as a working-discipline rule, and ADR-0050 / ADR-0044 are downstream “fix the harness and re-measure” applications of it.
-
The Phase-1 engine arc: ADR-0022 → 0027 → 0028 → 0029 (→ 0030, 0031 proposed).
Need.Countis pod count (0022) → roll-up demand is a constrained aggregate resource request (0027) → cycle-p99 is regime-parametric (0028) → Phase 1 becomes Omega-style OCC (0029, which supersedes 0028’s OCC-deferral). ADR-0030 (incremental delta-only) and ADR-0031 (ParSync partitioning) are Proposed conditional follow-ons to 0029 — spec-only until measurement promotes them.
The map
Code and test paths below are reproduced from the grounded digest. Where a decision shipped nothing, the cell reads spec-only (see the lineage above for why). ADR numbers link to the record; implementation paths are exact.
Architecture & topology
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0002 | Coordinator topology — single Raft group, single region (v1) | Accepted | v1 coordinator is one 3-replica Raft group in one region (3 AZs), hashicorp/raft + BoltDB on local disk; its region is a documented SPOF for cross-shard rebalancing only. | Simplest well-trodden shape; static stability makes a regional coordinator outage degraded, not lost, so the SPOF is defensible. | pkg/coordinator/coordinator.go, pkg/coordinator/fsm.go, pkg/coordinator/join.go, go.mod | pkg/shard/no_coordinator_dep_test.go, pkg/coordinator/coordinator_test.go, pkg/coordinator/join_test.go |
| 0006 | Shards self-register via the ReportShard heartbeat | Accepted | ShardReport gains optional shard_address (field 8); first report from an unknown shard Raft-Applies AddShard{ID,Address} synchronously, then the cheap MarkHeartbeat path; ErrShardExists swallowed. | Folding registration into the heartbeat avoids a second RPC, auth surface, and startup retry path. | pkg/coordinator/grpc_server.go, pkg/coordinator/fsm.go, pkg/coordinator/state.go, api/proto/bigfleet/v1alpha1/coordinator.proto | pkg/coordinator/grpc_server_test.go, pkg/coordinator/coordinator_test.go, pkg/coordinator/grpc_server_identity_test.go |
| 0007 | Cluster-to-shard binding is operator-chosen at deploy time | Accepted | The operator’s --shard-addr flag (from the chart’s shardAddress) is the canonical binding; first-contact-wins, re-bind needs a chart upgrade + restart; reconnects dial the same static StatefulSet DNS. | Static addressing keeps the coordinator out of the data-plane dial/reconnect path, preserving static stability. | cmd/operator/main.go, deploy/helm/bigfleet-operator/values.yaml, deploy/helm/bigfleet-operator/templates/deployment.yaml, pkg/operator/operator.go, pkg/operator/stream.go, pkg/shard/session.go | pkg/operator/operator_test.go |
| 0047 | Coordinator quorum by ordinal join; offline snapshot restore | Accepted (M75) | StatefulSet pattern: ordinal 0 honours --bootstrap, ordinals >0 join the leader via leader-only JoinRaftCluster; idempotent re-join; bigfleetctl snapshot save/restore rebuilds a stopped node as a single-voter, others re-form quorum via join. | The “3-replica HA” install actually bootstrapped three independent single-node clusters (AddVoter had zero callers) and DR had no restore tool. | pkg/coordinator/join.go, pkg/coordinator/coordinator.go, pkg/coordinator/grpc_server.go, pkg/coordinator/snapshot_restore.go, pkg/coordinator/snapshot_export.go, cmd/bigfleet/coordinator.go, cmd/bigfleetctl/main.go, deploy/helm/bigfleet/templates/coordinator-statefulset.yaml, deploy/helm/bigfleet/values.yaml | pkg/coordinator/join_test.go, test/integration/raft_quorum_test.go |
Decision engine & cost
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0003 | Shard inventory snapshots eventually consistent on the cycle hot path | Superseded (M44.4 Drop A; fold goroutine + live triple-indexes removed at M66.1) | Originally: cycle read an O(1) CycleSnapshot() from a background debounced fold goroutine. Reversed: cycle now uses synchronous Snapshot(); the fold loop and CycleSnapshot were removed. | Eventual consistency was safe (idempotent actions, stale-reject Apply), but synchronous Snapshot() was simpler and made the fold goroutine redundant. | spec-only — mechanism removed; supersession comments at pkg/inventory/inventory.go, pkg/shard/shard.go; surviving Inventory.Snapshot() is what the cycle now uses | — |
| 0019 | Phase 1 cloud-vs-bench discrepancy — instrument before optimising | Accepted | Add per-sub-path Phase 1 instrumentation before touching pkg/decision/; rewrite the M38 failure injector to ConfiguredCount()×ratePerSec Poisson mean (default 1.16e-7). | Cloud and bench disagreed 6000×; optimising on bench would optimise the wrong code. | pkg/metrics/metrics.go, cmd/bigfleet/shard.go, pkg/provider/fake/fake.go, test/scaletest/chart/values.yaml, test/scaletest/chart/templates/shard.yaml, pkg/decision/phase1_realistic_bench_test.go (note: the Phase 1 sub-path histograms are defined but no longer observed — the allocator they targeted became the OCC broker, ADR-0027/0029) | pkg/decision/phase1_realistic_bench_test.go, pkg/decision/phase1_takecolocated_bench_test.go, pkg/needs/snapshot_bench_test.go |
| 0021 | Persistent execute pool — decouple action execution from the cycle barrier | Accepted | Replace per-cycle dispatch + wg.Wait with a shard-scoped persistent worker pool draining a bounded actionQueue, each action capped by ExecuteTimeout (30 s); cycle enqueues and returns, dropping (and counting) on full queue. | wg.Wait made wall-clock = max(action latency) and cascaded cycle-ctx cancellation into machine state, capping throughput. | pkg/shard/shard.go, pkg/metrics/metrics.go | pkg/shard/shard_test.go, pkg/shard/execute_drain_test.go, pkg/shard/reconcile_test.go |
| 0022 | Need.Count is Pod count, not machine count | Accepted (predecessor of ADR-0027) | Treat Need.Count as Pod count; Phase 1/3 compute machines by diffing aggregate demand (Profile.Resources×Count) against Σ Machine.Allocatable in resource-vector space, taking the bottleneck dimension. | The impl had drifted to one Bootstrap per Pod, over-provisioning by the density factor when Pods shared a Profile. | pkg/decision/phase1_assign.go, pkg/needs/needs.go, pkg/decision/phase3_reclaim.go (final code is the ADR-0027 form — PodsPerMachine/densityFor removed) | pkg/decision/phase1_same_test.go |
| 0027 | Roll-up demand is a constrained aggregate resource request | Accepted | CapacityNeed redefined: aggregate_resources replaces per-pod resources, min_unit is the atomic schedulable unit, count removed; Phase 1 supply is Σ Machine.Allocatable counted once per machine (no density projection). | Per-fingerprint dedicated-density accounting over-credited phantom capacity when fingerprints shared physical eligibility, masking real deficits (shortfalls=0 while pods stuck). | api/proto/bigfleet/v1alpha1/capacity.proto, pkg/proto/bigfleet/v1alpha1/capacity.pb.go, pkg/decision/phase1_assign.go, pkg/decision/occ/cycle.go, pkg/decision/occ/seed.go, pkg/decision/occ/state.go, pkg/decision/match.go | pkg/decision/phase1_test.go, pkg/decision/phase1_realistic_test.go, pkg/decision/phase1_same_test.go, pkg/decision/phase1_spread_test.go, pkg/decision/phase3_test.go, pkg/decision/integration_test.go, pkg/decision/occ/cycle_test.go |
| 0028 | Cycle-p99 SLO is regime-parametric; realistic catalog scales with Need cardinality | Accepted (OCC-deferral superseded by ADR-0029) | The 100 ms cycle-p99 bar applies only to the aggregated regime; the realistic regime is graded on a per-Need Phase 1 p99 bar (≤200 µs, later demoted to aspirational) plus relaxed envelopes scaling with Need cardinality. | Each sameRack group becomes its own Need, so Phase 1 wall-clock scales with Need cardinality; the absolute cycle bar grades the workload, not BigFleet. | spec-only (SLO framing; no constants in tree) | pkg/decision/phase1_uber5k_bench_test.go, pkg/decision/phase1_realistic_bench_test.go, pkg/decision/phase1_takecolocated_bench_test.go |
| 0029 | Phase 1 Omega-style OCC — shared-state, commit-broker priority, dual-mode commits | Accepted | Phase 1 redesigned as Omega-style OCC: shared immutable snapshot, single unordered Need queue served by a worker pool, single mutex-guarded commit broker doing per-bucket seqno CAS; priority enforced reactively at the broker (displacement + re-queue); ModeIncremental/ModeAllOrNothing; bounded retries → shortfall. | Constant-factor optimisation of the single-threaded sorted loop can’t reach ADR-0028’s envelope; the only lever is iteration-count reduction via intra-shard concurrency. | pkg/decision/occ/cycle.go, pkg/decision/occ/broker.go, pkg/decision/occ/state.go, pkg/decision/occ/types.go, pkg/decision/occ/seed.go, pkg/decision/occ/candidates.go, pkg/decision/occ/samebucket.go, pkg/decision/occ/samesupply.go, pkg/decision/occ/poolcache.go, pkg/decision/occ/match.go, pkg/decision/phase1_assign.go | pkg/decision/occ/broker_test.go, pkg/decision/occ/cycle_test.go, pkg/decision/occ/displacement_test.go, pkg/decision/occ/state_test.go, pkg/decision/occ/candidates_test.go, pkg/decision/occ/samebucket_test.go, pkg/decision/occ/incumbency_repro_test.go, pkg/decision/occ/samesupply_bench_test.go |
| 0030 | Incremental Phase 1 — delta-only processing | Proposed (conditional follow-on to 0029) | Layer a delta-only fast path over OCC: per-Need digest + inventory transition events detect changed Needs/machines; only the delta is OCC-processed, drift caught by digest check + periodic full re-sync. | Single-pass OCC is still O(NeedsTable); in steady state most Needs don’t change, so re-processing them is wasted work. | spec-only (the unrelated --incremental-reconcile flag is ADR-0004’s since_revision cursor, not this) | — |
| 0031 | ParSync-style partitioned synchronization | Proposed (conditional follow-on to 0029) | Record (don’t build) a ParSync design partitioning the OCC claimed-set into P partitions, refreshing one per worker per cycle; promotion gated on measured conflict-rate ≥0.3 or a >500K ceiling raise. | Above 500K BigFleet scales by adding shards, not bigger ones, so the contention win is zero until a ceiling raise exists — YAGNI. | spec-only | — |
Capacity model & attribution
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0036 | Phase 3 reclaim must not fire before a cluster’s first rollup | Accepted | Shard tracks firstRollupReceived[ClusterID] (set true on first RollupReport, even an empty one); Phase 3 early-returns for any cluster whose flag is false. | An empty NeedsTable at startup is indistinguishable from “no demand”; without the gate Phase 3 would reclaim the whole fleet before operators reconnect (static-stability violation). | pkg/shard/shard.go, pkg/decision/phase3_reclaim.go | pkg/decision/phase3_test.go, pkg/shard/safety_test.go |
| 0040 | Same-domain attribution is unified — every supply-crediting site is domain-aware | Accepted | Every supply-crediting site becomes domain-aware for Same-Profiles, mirroring FindSame’s single-best-bucket rule; Addendum: the Same domain is chosen once per Need per cycle, jointly over creditable + acquirable supply, with Phase 3 mirroring identical scoring. | Crediting was vacuous (across domains) while acquisition was strict — Phase 1 chased un-finishable gangs, Phase 3 reclaimed the over-provision: a self-sustaining Bootstrap≈Reclaim equilibrium. | pkg/decision/occ/seed.go, pkg/decision/occ/samebucket.go, pkg/decision/occ/cycle.go, pkg/decision/phase3_reclaim.go, pkg/decision/phase1_assign.go, pkg/decision/occ/types.go | pkg/decision/samebucket_test.go, pkg/decision/occ/samebucket_test.go, pkg/decision/occ/candidates_test.go, pkg/decision/integration_test.go |
| 0041 | Sub-machine Same-Needs fold into atomic aggregates | Accepted | NormalizeDemand: a Same-Need that fits one matching machine folds into one plain Need (min_unit = one gang’s aggregate); Needs that fit no machine keep their per-gang Same Need. Riders: Phase 3 acquirable fold; ChooseSameBucket prefers creditable in the satisfiable regime. | ADR-0024+0039 reshaped demand into ~2,400 sub-machine gang Needs each up-rounding to a whole machine; a gang that fits one machine needs no Same machinery. | pkg/decision/normalize.go, pkg/decision/occ/seed.go, pkg/decision/occ/samebucket.go, pkg/decision/phase3_reclaim.go | pkg/decision/normalize_test.go, sim/closedloop_test.go |
| 0042 | Unsatisfiable-regime domain choice is sticky at equal coverage | Accepted | In ChooseSameBucket’s unsatisfiable regime, switch domains only for strictly greater coverage; at equal coverage the incumbent domain (creditable supply present) wins before count/lexicographic tie-breaks. Stateless. | Multi-machine GPU gangs no rack can host re-derived the joint domain from scratch each cycle; identical-total racks tied constantly, so claim-walk perturbations flipped the tie and drove ~27/sec Bootstrap↔Reclaim churn. | pkg/decision/occ/samebucket.go | sim/m61_repro_test.go |
| 0042-addendum | Aged acquisition parking — the escalation path, engaged | Accepted (extends 0042 after PARTIAL cloud validation) | (1) group ID on the wire (CapacityNeed.group, field 9); (2) aged acquisition parking — at parkAfterCycles=8 a persistently-unsatisfied class goes creditable-only; (3) re-probe every reprobeEveryCycles=32. Per-class age ledger on the shard only, no coordinator. | 0042’s exact-tie pinning was too narrow: per-domain acquirable totals shift slightly each cycle so coverage is rarely exactly equal and the strictly-greater branch keeps firing on marginal deltas. This is the parking cautionary tale flagged by ADR-0043. | api/proto/bigfleet/v1alpha1/capacity.proto, pkg/proto/bigfleet/v1alpha1/capacity.pb.go, pkg/needs/needs.go, pkg/shard/shard.go, pkg/decision/occ/cycle.go, pkg/decision/occ/seed.go, pkg/decision/occ/types.go, pkg/decision/phase1_assign.go, pkg/decision/phase2_inversions.go, pkg/decision/phase3_reclaim.go | pkg/shard/parking_test.go, pkg/decision/phase2_test.go, sim/m61_repro_test.go |
| 0045 | Capacity counts for a cluster iff it is bound — BigFleet never models packing | Accepted (supersedes its own first draft; M68 dissolves in) | One rule: capacity counts iff bound (Configure is atomic fulfillment; the machine state machine is the only ledger). Phase 1 fulfills demand−bound; Phase 3 reclaim is triggered by demand shrinkage only; satisfied-but-stuck is the cluster’s problem. Per-machine consumed vectors / residual-fit / bound-open splits rejected by name. | Any arithmetic anticipating whether the cluster’s scheduler can use bound capacity shadows the scheduler (“not a scheduler” hard rule); the bound-vs-demand contract removes the Bootstrap≈Reclaim class by construction. | pkg/decision/phase3_reclaim.go, pkg/decision/phase1_assign.go | sim/m67_repro_test.go, pkg/decision/phase3_test.go, pkg/decision/integration_test.go, sim/m73_release_test.go |
| 0051 | Same-domain choice follows this gang’s bindings (gang-granular attribution) + M77h machine-selection | Accepted (refines 0045, does not reverse it; M77g + M77h) | Record the serving gang on each binding via additive bigfleet.lucy.sh/assigned-group (from Need.Group at Configure-time; machine gains AssignedGroup). ChooseSameBucket breaks capped-coverage ties on the gang’s OWN creditable coverage; M77h: incumbentFirst stably partitions a gang’s incumbents ahead of non-incumbents under stop-when-covered. | Cluster-granular coverage cannot tell “this domain holds my gang’s machines” from “an equal number of unrelated machines”; under ADR-0050’s bootstrap dwell the tie fell through to moving acquirable slack, causing a sustained domain-flap lockstep. | pkg/machine/shardmetadata.go, pkg/machine/machine.go, pkg/decision/occ/samebucket.go, pkg/decision/occ/seed.go, pkg/decision/action.go, pkg/provider/fake/fake.go | pkg/decision/occ/samebucket_test.go, pkg/decision/occ/incumbency_repro_test.go, pkg/machine/shardmetadata_test.go, sim/incumbency_repro_test.go, sim/gang_dwell_test.go, test/conformance/metadata_test.go |
The full prose walkthrough of this group is
domain-attribution.md(companion deep-dive), with supporting detail inphase1-occ.mdandneeds-table.md.
Provider boundary
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0004 | Incremental reconcile via since_revision — opt-in, deltas only | Accepted | Config.IncrementalReconcile (default false). Off = always-correct full List() + removal walk. On = pass reconcileCursor as ListFilter.SinceRevision, apply deltas, advance cursor, skip removal walk. Cursor is process-state; tombstones deferred. | Unfiltered List dominated the cycle (~87% at 500K); cursor deltas cut shard cycle p99 ~81% while the opt-in flag keeps the safe full-list default. | pkg/shard/reconcile.go, pkg/shard/shard.go, pkg/provider/fake/fake.go, api/proto/bigfleet/v1alpha1/provider.proto, pkg/provider/provider.go | test/conformance/conformance_test.go, pkg/shard/cycle_phasedump_test.go |
| 0005 | The provider boundary is the validation point; reconcile trusts domain types | Accepted (amended by ADR-0046 Addendum / M70) | reconcile applies provider machines directly to inventory without the MachineToProto+MachineFromProto round-trip; validation sits at each provider boundary (pkg/conv) and inventory.Apply is the apply-path net. (M70 re-added cost-field validation via validateProviderMachine on the slow path.) | The per-reconcile round-trip re-validated the same enum twice and dominated post-burst cycles; moving validation to the boundary dropped cycle mean ~24% at 500K. | pkg/shard/reconcile.go, pkg/conv/conv.go, pkg/provider/grpcadapter/grpcadapter.go, pkg/provider/fake/fake.go, pkg/machine/machine.go | pkg/shard/reconcile_test.go, pkg/machine/machine_test.go, pkg/conv/conv_test.go |
Operator & CRDs
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0009 | ReclaimInstruction uses policy/v1 Eviction and acks before drain completes | Accepted | Operator cordons each node synchronously, patches UpcomingNode to Draining, sends ReclaimAck (started semantics), then drains async: skip DaemonSet pods, post policy/v1 Eviction, retry 429/PDB with 2 s backoff bounded by grace_period_seconds, walk to Drained/Failed. | policy/v1 Eviction makes the apiserver enforce PDBs; ack-on-cordon is the honest static-stability post-condition — a multi-minute drain must not hold the session recv-loop hostage. | pkg/operator/reclaim.go | pkg/operator/reclaim_internal_test.go |
| 0010 | Minimum Kubernetes version 1.31 | Accepted | All three charts declare kubeVersion: ">= 1.31.0-0"; no back-compat shim for rendering the CRD without selectableFields. | The CapacityRequest CRD’s selectableFields (powering kubectl --field-selector=status.phase) only went GA in 1.31. | deploy/helm/bigfleet/Chart.yaml, deploy/helm/bigfleet-operator/Chart.yaml, deploy/helm/bigfleet-unschedulable-pod-controller/Chart.yaml, api/crd/bigfleet.lucy.sh_capacityrequests.yaml | (enforced at helm-install / helm template --kube-version; no Go test) |
| 0011 | BootstrapTemplate is a helm-values text/template, not a CRD or webhook | Accepted | Configured via a bootstrapTemplate values block rendered into a ConfigMap, mounted at /etc/bigfleet/bootstrap.tmpl, parsed at startup; Go callback retained for embedders (callback wins). No CRD, webhook, or Sprig — stdlib text/template only. | File-mounted parse-once template keeps the BootstrapRequest hot path free of any runtime apiserver/webhook coupling. | pkg/operator/bootstrap_template.go, pkg/operator/bootstrap.go, pkg/operator/operator.go, cmd/operator/main.go, deploy/helm/bigfleet-operator/values.yaml, deploy/helm/bigfleet-operator/templates/deployment.yaml | pkg/operator/bootstrap_template_test.go |
| 0012 | Helm charts published to GHCR as OCI artefacts on every push to main | Accepted | New charts.yml mirrors images.yml: on push, helm package + helm push to oci://ghcr.io/<owner>/charts/<chart> tagged with the Chart version (immutable, no floating latest); on PR, helm lint/package/template --kube-version=1.31.0. | OCI-via-GHCR piggy-backs on the existing image-publishing auth/flow, letting users install without cloning. | .github/workflows/charts.yml | (CI workflow; no Go test) |
| 0016 | NodeStateUpdate carries node identity (labels, resources, taints) | Accepted | NodeStateUpdate gains labels (9), resources (10), taints (11, new Taint message); shard populates from the machine Profile on every emit, operator copies into UpcomingNode.Spec.{Labels,Resources,Taints}. | Any controller pre-allocating against an upcoming node needs its shape before kubelet joins; the shard already holds it. (Taints plumbed but not exercised by a synthetic emitter — labels+resources only.) | api/proto/bigfleet/v1alpha1/shard.proto, pkg/shard/shard.go, pkg/operator/upcoming.go | pkg/apis/bigfleet/v1alpha1/roundtrip_test.go |
| 0024 | Co-location via podAffinity — the CoLocation CR field, roll-up aggregates | Accepted (builds on ADR-0022) | Derive co-location from required podAffinity, carried as a structured CoLocationTerm {LabelSelector, TopologyKey}; UPC translates podAffinity→CoLocation, operator derives aggregation group + Same key at roll-up, retiring CoLocationKey. Companion: 256 MiB gRPC ceiling in pkg/grpcutil. | The old owner-UID key put every pod in its own group so the roll-up never aggregated (O(unschedulable-pods)); podAffinity is the native, zero-user-change signal. | pkg/apis/bigfleet/v1alpha1/capacityrequest_types.go, pkg/apis/bigfleet/v1alpha1/zz_generated.deepcopy.go, pkg/controller/cr/controller.go, pkg/operator/rollup.go, pkg/grpcutil/grpcutil.go | pkg/controller/cr/controller_test.go, pkg/operator/rollup_topology_test.go, pkg/apis/bigfleet/v1alpha1/roundtrip_test.go, pkg/operator/rollup_colocated_bench_test.go |
| 0039 | One CapacityRequest per Pod — not per unschedulable Pod | Accepted | The reference UPC creates a CR for every Pod, not only reason=Unschedulable, honouring Fleet-Scale Kubernetes §6.1; CR stays owner-referenced and GC’s on deletion. | ~84% of bound Pods carried no CR (pre-bind fast-path + ADR-0038 recreated Pods bypass Unschedulable), undercounting demand ~6× and giving Phase 3 a phantom surplus. | pkg/controller/cr/controller.go | pkg/controller/cr/controller_test.go |
Security & fencing
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0008 | Coordinator admin RPCs — leader-only, unauthenticated in v1, sidecar for external | Accepted; transport/authn posture superseded by ADR-0048 (leader-only contract stands) | All admin RPCs are leader-only (followers reject FailedPrecondition); reads go through the leader’s State RLock. v1 ships unauthenticated (NetworkPolicy / external sidecar); bigfleetctl is the canonical insecure-by-default client; SetQuota deferred. | Leader-only avoids stale-read footguns + client-side leader-cache logic; shipping no in-tree authn avoids picking an identity winner. | pkg/coordinator/grpc_server.go, cmd/bigfleetctl/main.go | pkg/coordinator/grpc_server_test.go |
| 0048 | Opt-in file-based mTLS with bigfleet:// URI SAN identity binding | Accepted (M74) (supersedes ADR-0008 transport posture) | Symmetric --tls-cert/--tls-key/--tls-ca on every server+client (once in pkg/grpcutil): all three = mTLS (TLS 1.3, mutual verify), none = plaintext, partial = startup error; hot-reload certs. Identity is exactly one bigfleet:// URI SAN per cert; shard Session binds SAN to Hello.cluster_id, admin surface requires bigfleet://admin, mismatch → PermissionDenied. | Every surface was plaintext and the shard trusted the client-asserted Hello.cluster_id, so any reachable client could impersonate any cluster; identity binding demotes ADR-0046’s roll-up guard to defence-in-depth. | pkg/grpcutil/tls.go, pkg/grpcutil/grpcutil.go, pkg/shard/session.go, pkg/coordinator/grpc_server.go, pkg/coordinator/join.go, pkg/shard/coordclient/coordclient.go, pkg/provider/grpcclient/grpcclient.go, pkg/operator/operator.go, cmd/bigfleet/shard.go, cmd/bigfleet/coordinator.go, cmd/operator/main.go, cmd/bigfleetctl/main.go | pkg/grpcutil/tls_test.go, pkg/shard/session_identity_test.go, pkg/coordinator/grpc_server_identity_test.go, pkg/grpcutil/tlstest/tlstest.go |
Scale-test methodology & SLOs
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0013 | Demand-to-inventory regimes and SLOs | Accepted (cycle-p99 gate superseded by ADR-0014; three-regime scheme not built) | Promised three regimes with distinct SLOs: steady-state (≤2%, p99 ≤50 ms), burst (≤10%, p99 ≤100 ms, the gate), reprovisioning (≤100%, convergence ≥5,000 bindings/cycle). | Real fleets live in the burst regime; full-fleet reprovisioning is a backlog-drain deserving a throughput contract, not a per-cycle SLO. | spec-only (the named three-regime/convergence-rate scheme has no code; ADR-0014 reframed it) | — |
| 0014 | SLO posture — binding latency is the gate, cycle wall-clock is a tracked metric | Accepted (amended by ADR-0018; built on by ADR-0017) | bindingLatencyP99 (CR creation → Configured) becomes the user-facing gate with per-tier targets; shardCycleDurationP99 ≤ rollupInterval/2 becomes a tracked envelope, not a gate. | No comparable system gates a release on a sub-100 ms rebalance loop; users feel binding latency. | test/scaletest/cmd/scaletest-runner/main.go, test/scaletest/cmd/pod-shim/main.go | (harness wiring; exercised end-to-end by scaletest runs) |
| 0015 | Realistic archetype improvements (multiplicity, bimodal lifetimes, bursts, Same-rack, size skew) | Accepted | Five harness extensions: fingerprint multiplicity (sizeBuckets), bimodal CR lifetimes (meanLifetimeSeconds), concentrated burst actions, Same-rack co-location (sameRack/groupSizeRange), heavy-tailed clusterSizeDistribution. | The M31 single-shape catalog was “less honest than it claims”; conclusions drawn against it may not generalise to production demand. | pkg/scaletest/archetype/archetype.go, pkg/scaletest/archetype/sizing.go, test/scaletest/cmd/load-driver/main.go, cmd/bigfleet/shard.go, test/scaletest/cmd/scaletest-runner/main.go, test/scaletest/profiles/archetypes/realistic.yaml | pkg/scaletest/archetype/archetype_test.go, pkg/scaletest/archetype/sizing_test.go, pkg/scaletest/archetype/realistic_mix_test.go, test/scaletest/cmd/load-driver/main_test.go |
| 0017 | Per-CR binding latency is the user-facing metric; fingerprint fan-out is its own thing | Accepted (builds on 0014; renamed/scoped by 0018) | Add a per-Pod bigfleet_scaletest_pod_bind_latency_seconds histogram (in pod-shim) as the gate source; recast the legacy histogram as a fan-out diagnostic. Addenda: stop falling back to the legacy histogram; make Pod-mode the default. | The legacy histogram measured per-(cluster,fingerprint) fan-out, ramped to the top bucket on a 50-cluster run, and was a gameable gate. | test/scaletest/cmd/pod-shim/main.go, test/scaletest/cmd/scaletest-runner/main.go, test/scaletest/cmd/load-driver/main.go | (exercised through scaletest harness runs) |
| 0018 | ”binding latency” in the harness is internal-only; the user-facing number lives elsewhere | Accepted (amends 0014; preserves 0017’s gate) | Rename bindingLatencyP99Seconds → internalBindingLatencyP99Seconds; reframe 0014’s tier targets as internal-only floors (fake provider returns instantly); real-provider validation moves to conformance / out-of-tree scaletests / production canaries. | The in-process fake contributes zero latency, so the harness metric measures only BigFleet’s internal contribution — calling it “what users feel” overstated it. | test/scaletest/cmd/scaletest-runner/main.go, test/scaletest/cmd/pod-shim/main.go, test/scaletest/profiles/uber-5k.yaml, test/scaletest/profiles/500k.yaml, test/scaletest/profiles/dev-50.yaml | (harness-config change; not unit-tested) |
| 0020 | Internal binding-latency SLO must respect the rollup interval | Accepted | Set the harness internal-binding-latency SLO to 15 s (~10 s rollup + ~5 s headroom) rather than lowering the operator’s 10 s rollupInterval; cloud profiles carry an explicit override. | The 10 s rollupInterval is a hard p99 floor a 5 s SLO can never clear; 10 s rollup is the right production posture. | test/scaletest/cmd/scaletest-runner/main.go, pkg/operator/operator.go, test/scaletest/profiles/5k.yaml, test/scaletest/profiles/uber-50k.yaml, test/scaletest/profiles/uber-1m.yaml, test/scaletest/profiles/uber-5m.yaml | pkg/operator/operator_test.go |
| 0023 | Real kube-scheduler in the harness, retire pod-shim’s binding role | Accepted | Replace pod-shim’s binder with a real kube-scheduler (MostAllocated to preserve ADR-0022 density) per kwok apiserver; keep only UpcomingNode→fake-Node as a node-creator binary; gate behind harness.scheduler. Harness-only. | Pod-shim’s custom binder (102 s p99) had become the dominant variable in the published numbers — measuring the harness, not BigFleet. | test/scaletest/cmd/node-creator/main.go, test/scaletest/image/entrypoint-apiserver.sh, test/scaletest/image/entrypoint-workload.sh, test/scaletest/chart/values.yaml, test/scaletest/chart/templates/kwok-clusters.yaml, test/scaletest/profiles/dev-50.yaml, test/scaletest/profiles/dev-500.yaml, test/scaletest/profiles/uber-500k.yaml | (harness infra; validated via kind/cloud runs) |
| 0025 | The load-driver anchors sameRack groups — a gang-scheduler stand-in | Accepted | The load-driver force-binds one anchor pod per sameRack group to break the self-referential podAffinity bootstrap deadlock; kube-scheduler places the rest. | Lets sameRack profiles clear the ramp gate while keeping ADR-0024’s real-podAffinity path — gang bootstrapping is genuinely above the autoscaler. | test/scaletest/cmd/load-driver/main.go | test/scaletest/cmd/load-driver/main_test.go |
| 0026 | The scaletest harness must model the Speculative tier | Accepted | seedFakeInventory seeds a Speculative quota pool (--seed-speculative N, default non-zero) alongside Idle/Configured; slots minted as OnDemand with non-zero price + small interruption probability so effective_cost is meaningful and Phase 1 prefers Idle then Speculative. | The harness only ever had a fixed Idle pool, so unmet demand became permanent shortfall — leaving BigFleet’s entire elastic-procurement half as dead code, mis-measuring ceilings. | cmd/bigfleet/shard.go, pkg/provider/fake/fake.go, test/scaletest/cmd/scaletest-runner/main.go, test/scaletest/cmd/scaletest-runner/preflight.go, pkg/scaletest/preflight/preflight.go | test/scaletest/cmd/scaletest-runner/render_test.go, test/scaletest/cmd/scaletest-runner/preflight_test.go, test/conformance/selftest_test.go, pkg/provider/fake/fake_test.go |
| 0028 | Cycle-p99 SLO is regime-parametric | Accepted | (See Decision engine & cost above — graded on per-Need Phase 1 p99 + cardinality-scaled envelopes; OCC-deferral superseded by 0029.) | Phase 1 wall-clock scales with Need cardinality; the absolute bar grades the workload, not BigFleet. | spec-only | pkg/decision/phase1_uber5k_bench_test.go, pkg/decision/phase1_realistic_bench_test.go, pkg/decision/phase1_takecolocated_bench_test.go |
| 0032 | Realistic archetype catalog — production-calibrated distribution | Accepted | Replace the six-archetype catalog with ten Pod-count-weighted archetypes (70% tiny-stateless long tail … 1% gpu/critical), fold sidecar overhead into per-Pod shape, add allowPartial + spreadConstraintProb/spreadConstraint (~42% of Needs). | The prior catalog was miscalibrated (missing modal small Pod, no spread, single-priority, oversized gangs), so every uber-* number benchmarked a non-representative workload. | pkg/scaletest/archetype/archetype.go, test/scaletest/profiles/archetypes/realistic.yaml, test/scaletest/cmd/load-driver/main.go | pkg/scaletest/archetype/realistic_mix_test.go, pkg/scaletest/archetype/archetype_test.go, pkg/scaletest/archetype/sizing_test.go, test/scaletest/cmd/load-driver/main_test.go |
| 0033 | Phase 1 supply-credit must respect bind readiness | Rejected (superseded by ADR-0035) | Proposed OC1: a Configured machine credits supply only after UpcomingNode reaches Ready, via Machine.BindReady + a NodeBindReady stream message. Rejected — the bind plateau was a kube-scheduler ramp property; the fix moved to the harness. | The triggering plateau was a kube-scheduler property under high label-cardinality that only manifests at ramp, and ramp is not an SLO. | spec-only (no code shipped) | — |
| 0034 | Scaletest is bring-your-own-substrate | Accepted | Split each *-Nk.yaml into a substrate-agnostic test definition (scale/catalog/seed/loadProfile) + a separately-named example substrate (example-fat-host, example-mid-host, example-kind-laptop); runner derives geometry/cost/feasibility from profile × substrate; drop provider-named profiles. | Profiles conflated “what test” with “where to run”, leaking substrate names into filenames and forcing N×M file growth. | test/scaletest/cmd/scaletest-runner/main.go, test/scaletest/substrates/example-fat-host.yaml, test/scaletest/substrates/example-mid-host.yaml, test/scaletest/substrates/example-kind-laptop.yaml, test/scaletest/profiles/5k.yaml | test/scaletest/cmd/scaletest-runner/merge_test.go, test/scaletest/cmd/scaletest-runner/substrate_test.go, test/scaletest/cmd/scaletest-runner/byo_integration_test.go |
| 0035 | Scaletest SLOs are measured at steady state under churn, not at ramp | Accepted (supersedes ADR-0033 + M22 ramp-gating; amended 2026-06-14) | Gate pass/fail on steady-state per-CR binding-latency / cycle / rollup SLOs during a churn soak, inventory pre-seeded + Pods pre-bound at install; ramp becomes observational. 2026-06-14 amendment: reclaim baseline at soakStart+settleSeconds, bounded maxReclaimActionsDuringSoak gate (accepting the ADR-0021 async floor). Harness-only. | Ramp throughput is dominated by downstream kube-scheduler behaviour and is not the SLO; conflating ramp with the SLO produced a multi-week rabbit hole. | test/scaletest/cmd/scaletest-runner/main.go, test/scaletest/cmd/load-driver/main.go, test/scaletest/profiles/dev-50.yaml, test/scaletest/profiles/5k.yaml | (runner gate verified by read; no dedicated *_test.go) |
| 0037 | Drop synthetic team/app label axes from the catalog | Accepted | The catalog’s node-affinity dimensions must mirror real production (instance-type, zone, hardware only); synthetic ownership axes (team, app) removed from realistic.yaml. The labelAxes mechanism is retained for a future real axis. | Routing synthetic fingerprint cardinality through Pod nodeAffinity made kube-scheduler reject 98.6% of placements (bind plateaued at 9.5%); team/app are ownership labels, not node-affinity dimensions. | test/scaletest/profiles/archetypes/realistic.yaml, pkg/scaletest/archetype/archetype.go | pkg/scaletest/archetype/archetype_test.go |
| 0038 | Scaletest workloads are controller-managed objects, not bare Pods | Accepted | Load-driver creates Deployments (stateless) / StatefulSets (stateful), one per archetype fingerprint per cluster; the kwok apiserver runs the deployment/replicaset/statefulset controllers so evicted Pods recreate. No BigFleet change. | Bare Pods don’t survive eviction, so every Phase 3 reclaim permanently destroyed demand (CR cascade-GC’d) → a self-sustaining Bootstrap+Reclaim cascade. | test/scaletest/cmd/load-driver/main.go, test/scaletest/image/entrypoint-apiserver.sh, test/scaletest/image/Dockerfile | test/scaletest/cmd/load-driver/main_test.go |
| 0043 | Harness-observed triggers get a demand-realism check before mechanism ships | Accepted | Any ADR motivated by harness-observed evidence must contain a “Demand realism” section (what demand triggers it, would production emit it, if not fix the harness and re-measure first) before designing mechanism. A gate, not a formality; incident/paper-triggered ADRs exempt. | The single largest pool of unforced engine complexity (the ADR-0042 parking layer) was built against a demand shape one catalog archetype fabricated. | spec-only — codified as a working-discipline rule; first applied in docs/adr/0044-machine-count-aware-seed-sizing.md | — |
| 0044 | Seed machine pools are sized by machine demand, not workload weight | Accepted (follows from ADR-0043; harness-scope) | Seed machine shares derive from pod demand: machineShare ∝ podShare/podsPerMachine, podsPerMachine = density for core-resource archetypes / 1 when any bucket requests an extended resource; gang archetypes get a per-zone floor of max(GroupSizeRange). | Weight-proportional pools underweight whole-machine archetypes’ supply by ~density×, so GPU gangs were short 120–238 machines/zone every cycle on a fleet with ample aggregate capacity. | pkg/scaletest/archetype/sizing.go, pkg/scaletest/archetype/archetype.go, test/scaletest/cmd/load-driver/main.go, test/scaletest/profiles/archetypes/realistic.yaml | pkg/scaletest/archetype/sizing_test.go, pkg/scaletest/archetype/archetype_test.go, pkg/scaletest/archetype/realistic_mix_test.go, test/scaletest/cmd/load-driver/main_test.go |
| 0050 | The realism catalog is calibrated to a realistic machine fleet, via per-archetype packing density | Accepted (M78 first step; harness-scope) | Calibrate realistic.yaml to a realistic machine fleet (~15% GPU), back-solving weights as machineShare × podsPerNode / E[replicas]; replace M66.2’s “GPU density = 1” with a per-archetype PodsPerNode (cpu/mem = 100, GPU inference = 8, GPU training = 1). | For a whole-machine GPU workload pod-share IS machine-share, so a realistic ~7% GPU pod mix implies an unrealistic ~90% GPU machine fleet — failing ADR-0043’s realism test. | pkg/scaletest/archetype/archetype.go, pkg/scaletest/archetype/sizing.go, test/scaletest/profiles/archetypes/realistic.yaml, test/scaletest/cmd/scaletest-runner/main.go, pkg/scaletest/preflight/preflight.go | pkg/scaletest/archetype/sizing_test.go, pkg/scaletest/archetype/realistic_mix_test.go |
Actuation safety
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0046 | Actuation safety rails — reclaim blast-radius cap, empty-roll-up quarantine, kill switch (+ Addendum: shadow mode, ingest validation, audit log) | Accepted (M70) | Three rails at the actuation/ingest boundary (pkg/decision untouched): (1) per-cycle per-cluster reclaim cap max(1, ⌊fraction×C⌋), default 0.05, only Reclaim capped (Phase 2 exempt); (2) empty-roll-up quarantine (<10% retained, held until 3 consistent drops); (3) --actuation-paused kill switch. Addendum: --dry-run, machine.Invariant cost-bounds at ingest (reject-loudly), --audit-log JSONL. | Nothing bounded the damage of a wrong decision — a zero-demand roll-up could drain a fleet in one cycle; the rails bound actuation volume (not allocation, so §16’s priority-only throttle is intact). | pkg/shard/safety.go, pkg/shard/shard.go, pkg/shard/session.go, pkg/shard/execute.go, pkg/shard/reconcile.go, pkg/machine/machine.go, pkg/metrics/metrics.go, cmd/bigfleet/shard.go, cmd/bigfleet/all_in_one.go, deploy/helm/bigfleet/values.yaml, deploy/helm/bigfleet/templates/shard-statefulset.yaml | pkg/shard/safety_test.go, pkg/machine/machine_test.go, pkg/shard/restart_test.go |
| 0049 | Idle→Speculative release — per-CapacityType idle holds inside Phase 3 | Accepted (M73) | Implements paper §8’s release half: Phase 3 emits Delete for an Idle machine iff not in the claimed-set AND its per-CapacityType hold (DefaultReleasePolicy: bare-metal/reserved = forever, on-demand = 10m, spot = 1m) has expired, from an in-memory idle-since stamp; executeDelete walks Idle→Deleting→Speculative. No per-cycle release cap — the hold window is the only rail. | Releasing an Idle machine has zero blast radius (Idle ⇒ unbound, counts for nothing under ADR-0045) and the re-buy loop can’t close (worst case one Create per machine per hold), so the hold window alone bounds churn. | pkg/decision/release.go, pkg/decision/phase3_reclaim.go, pkg/decision/action.go, pkg/inventory/inventory.go, pkg/shard/execute.go, pkg/shard/shard.go, pkg/metrics/metrics.go | pkg/decision/phase3_test.go, pkg/shard/execute_delete_test.go, pkg/inventory/inventory_test.go, sim/m73_release_test.go, test/conformance/conformance_test.go |
Process / meta
| ADR | Title | Status | Decision | Why | Implemented in | Guarded by |
|---|---|---|---|---|---|---|
| 0001 | Record architecture decisions | Accepted | Record significant (hard-to-reverse) decisions as sequentially-numbered immutable Markdown ADRs in docs/adr/; changing direction means a new superseding ADR, not editing an accepted one. | A discoverable, reviewable audit trail of why each path was chosen, recoverable without spelunking commit history. | docs/adr/, docs/adr/index.md (process ADR; the index convention is enforced by the project convention that every ADR adds a row to docs/adr/index.md) | — |
Footer
README.md— the internals index; prose deep-dives per subsystem (this page is their ADR→code companion).domain-attribution.md— the full walkthrough of the ADR-0040→0051 attribution arc (companion deep-dive for the Capacity model & attribution group).../adr/and../adr/index.md— the ADRs themselves and the canonical status table (this page defers to the index on status).