ADR-0029: Phase 1 Omega-style OCC — shared-state, commit-broker priority, dual-mode commits
Status
Accepted, 2026-05-17.
Context
ADR-0028’s empirical addendum documented three attempts to bring
Phase 1 under control at uber-50k under the realistic catalog
(scale-test bench). Baseline 4ce1e70 measured 9.41 ms/call ×
~31K calls/cycle ≈ 5-minute Phase 1, drift to ~15 minutes worst-case.
A parsed-form alloc-elimination pass (b9b7037, reverted) bought
~36% per-call but kept the structural cost. An O(buckets) aggregate
cache (0f05854, reverted) was a net regression because under
realistic catalog every co-location group becomes its own bucket per
ADR-0024 — bucket count ≈ Need count, so O(buckets) ≈ O(Needs)
delivered no asymptotic improvement.
The empirical conclusion stands: Phase 1’s wall-clock cost scales with Need cardinality, not with per-iteration cost. Constant- factor optimization on the single-threaded loop cannot reach ADR-0028’s regime-aware envelope. The remaining lever is iteration count reduction, which requires structural redesign.
Independent research (see external citations in References) converged on the same family of fixes — shared-state optimistic concurrency control (OCC) in the style of Google’s Omega [1], with the modern extensions ParSync [3] documented for production at Alibaba scale (40K decisions/sec, 100K-machine clusters). The Borg retrospective [2] is somewhat deflationary about Omega’s multi-scheduler win at Google specifically — Borg’s monolithic scheduler was not the production bottleneck for Google’s workload — but BigFleet’s measured 5–15 minute single- threaded Phase 1 is precisely the regime where Omega’s design does pay off. The architectural pattern (shared state, atomic per-task commits, fine-grained conflict detection) is the inheritance.
bigfleet.md §8 specifies Phase 1 as “walk needs top-down by
priority. Prefer Idle. Fall back to Speculative.” Today’s
implementation reads that literally — one goroutine, priority-
sorted single pass, global claimed-set mutated in place.
bigfleet.md §16 states “priority is the sole throttling
mechanism.” A close reading
of that hard rule shows it rules out other throttling mechanisms
(quota, admission, entitlement) — it does not require strict
priority-sorted pre-ordering of the cycle’s work. Priority can be
enforced at the commit broker without sacrificing intra-cycle
concurrency. That re-reading is the key that unlocks the OCC
redesign.
This ADR specifies the redesign. It supersedes the relevant
decision logic in pkg/decision/phase1_*.go and Decision §3 of
ADR-0028 regarding the OCC deferral (the answer is now “build
it”). ADR-0027’s resource-vector demand model is the substrate;
ADR-0027’s Phase 1 / Phase 3 attribution invariant (stage 5.1
thrash lesson) is preserved verbatim.
Goals
- Cycle p99 ≤ 100 ms at uber-50k under realistic catalog (currently ~15 minutes worst-case).
- Cycle p99 ≤ 1 s at uber-500k under realistic catalog (currently extrapolates to ~100 s).
- No regression in the aggregated regime (scaleway-* runs currently passing under the 100 ms canonical bar must still pass after cutover).
- Preserve every hard rule. In particular: no distributed locking on the hot path (this ADR is intra-shard concurrency only); cost formula unchanged; provider RPC surface unchanged; static stability holds (shards run autonomously during coordinator failover).
- Preserve the ADR-0027 stage 5.1 invariant. Phase 1 and Phase 3 must attribute supply identically. The commit barrier produces a coherent claimed-set Phase 2/3 can read with the same semantics they read today.
Non-goals
- Multi-shard OCC. Cross-shard topology stays bounded per
bigfleet.md§16 (“topology constraints do not cross shard boundaries”). Each shard’s Phase 1 is concurrent internally; shards remain independent. - Incremental Phase 1 (delta-only processing). Complementary lever; deferred to a follow-on ADR. The OCC design here leaves a clean place to layer incremental in as a future optimization.
- ParSync-style partitioned synchronization. Not required for any rung of the uber-* ladder. Beyond uber-500k the paper caps a shard’s machine count at the 500K ceiling and BigFleet scales horizontally via sharding (uber-1m = 2 shards × 500K, uber-5m = 10 shards × 500K). Each individual shard sees the same per-shard workload at every multi-shard rung, so OCC’s per-shard envelope carries through unchanged. ParSync becomes interesting only if a future ADR raises the per-shard ceiling itself; deferred to a follow-on ADR conditional on that change.
- Changes to Phase 2’s victim scoring or Phase 3’s reclaim ordering. Only the commit barrier interaction changes; algorithmic content is unchanged.
- Changes to the
Actionshape, the provider RPC surface, the capacity contract proto, or the operator’s roll-up format. This is apkg/decision/rewrite, not a wire-format change.
Design overview
┌─────────────────────────────────────────────────────────────────┐│ Shard Cycle (1 Hz) ││ ││ ┌─────────────┐ ┌────────────────────────────────┐ ││ │ Snapshot │ │ Need queue │ ││ │ (immutable, │───────▶│ (all Needs, one shared queue) │ ││ │ per cycle) │ │ │ ││ └─────────────┘ └──────────────┬─────────────────┘ ││ │ │ ││ │ ┌───────────────────────────┴────────────┐ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ Worker 1 │ │ Worker 2 │ │ ... │ │ Worker N │ ││ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ ││ │ │ │ │ ││ └─────────────┴─────┬───────┴──────────────┘ ││ │ ││ proposals │ (each proposal carries a mode: ││ ▼ incremental | all-or-nothing) ││ ┌─────────────────────────────┐ ││ │ Commit Broker │ ││ │ per-bucket sequence CAS │ ││ │ priority-on-conflict │ ││ │ proposal modes: │ ││ │ - incremental (default) │ ││ │ - all-or-nothing (gangs) │ ││ └──────────────┬──────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────┐ ││ │ Claimed Set + Shortfall │ ││ │ (coherent at barrier) │ ││ └──────────────┬──────────────┘ ││ │ ││ ──── COMMIT BARRIER ──── ││ │ ││ ▼ ││ ┌─────────────────────────────┐ ││ │ Phase 2 (inversions) │ ││ │ Phase 3 (reclaim) │ ││ └─────────────────────────────┘ │└─────────────────────────────────────────────────────────────────┘The shape is Omega’s [1, §3.4]: shared cell state (BigFleet’s inventory snapshot), private read views per worker, atomic commits through a single broker that detects conflicts at machine granularity. Where this ADR specializes for BigFleet:
- One commit barrier per cycle, not per-Need. Phase 2/3 read the claimed-set after the barrier — same input shape as today.
- Two commit modes at the broker — incremental (default; per-Omega [1, §3.4] the cheaper path) and all-or-nothing (for gang-scheduled Needs whose partial fill is unacceptable). Workers classify each Need at pickup and set the mode flag on their proposal; one shared worker pool serves both modes.
- Commit broker enforces priority on conflict — high-precedence
proposals win against in-flight lower-precedence claims. This is
how
bigfleet.md§16’s “priority is the sole throttling mechanism” is preserved without strict-pass ordering.
Notably absent: no scheduler-kind partition (Omega’s batch/service split). BigFleet’s measured workload (scale-test bench) shows uniform per-Need cost, not bimodal — the queue-isolation that motivated Omega’s split doesn’t pay off here. The Omega-faithful transaction-semantics distinction (incremental vs all-or-nothing) lives at the broker as a per-proposal mode, not as a worker-kind split. See Alternatives considered for the full rationale.
Detailed design
Shared state
Two shared structures per cycle:
-
Inventory snapshot — immutable for the cycle’s duration. The existing
inventory.Snapshotalready has this property; readers grab a reference at cycle start and never mutate it. This is the equivalent of Omega’s “private, local, frequently- updated copy” [1, §3.4] — each scheduler reads the same snapshot, no locks. -
Claimed-set — mutable, written only via the commit broker. Today this is
phase1Allocator.claimed map[machine.ID]struct{}. The redesign replaces the bare map with aclaimedSetstructure that the broker mutates under a single mutex; each bucket also carries a sequence number that increments on every claim. Schedulers read the claimed-set lock-free (sync.Map or atomic.Pointer to an immutable snapshot, depending on implementation cost — see Migration plan).
Per-bucket sequence numbers are the fine-grained conflict detection primitive — Omega measured 2–3× lower spurious-conflict cost from fine-grained vs coarse-grained [1, §5.2, Fig. 14]. The bucket is the natural CAS grain for BigFleet because ADR-0024/ADR-0019 already partition by co-location group.
Sketch:
// pkg/decision/occ/state.go (new package)type sharedState struct { snap *inventory.Snapshot // read-only, set at cycle start
mu sync.Mutex // guards claimed and bucket seqnos claimed map[machine.ID]bool bucketSeq map[bucketKey]uint64 // increments on every claim that // touches this bucket}
type bucketKey struct { state machine.State profileFP string sameKey string sameValue string}bucketSeq keys mirror the coLocatedBucket identity in today’s
allocator. A scheduler’s “read” captures the bucket’s seqno at
decision time; the commit broker rejects the proposal if the
seqno has moved since.
Worker pool
One shared worker pool of N goroutines, all pulling from one shared Need queue. No worker specialisation: every worker can process any Need.
The Need’s commit semantics (incremental vs all-or-nothing) are decided per-Need at pickup based on Profile shape (see Commit broker below). The pool itself is uniform.
Default pool size: runtime.GOMAXPROCS(). On the M5 Max dev box
(GOMAXPROCS=16) that’s 16 workers. The choice trades CPU
saturation against conflict rate — Omega’s measurement showed up
to ~32 concurrent schedulers without breakdown for Google’s
workload [1, §4.3, §5.1]; BigFleet’s per-bucket demand under the
realistic catalog is generally lower, so pressure on the broker is
expected to be lighter than Omega’s measurements.
Worker count is tunable from a flag (--phase1-workers); the
measurement plan in Performance projections includes verifying the
default is appropriate for production shard hardware. ParSync’s
adaptive-strategy guidance [3, §4] suggests dynamic worker counts
keyed to observed conflict rate; deferred to a follow-on ADR if
the static default proves wrong.
Phase 3 reclaim runs in its own goroutine after the cycle barrier (see Phase 2 / Phase 3 interaction). It is not part of the Phase 1 worker pool; OCC doesn’t buy obvious throughput for it and its ordering invariants (ADR-0027 stage 5.1) require the post-barrier coherent claimed-set as input.
Single-band concurrent scheduling
bigfleet.md §16’s “priority is the sole throttling mechanism”
is preserved via the commit broker, not via iteration order. Within a cycle,
all workers race concurrently; there is no priority-sorted
outer loop. Each worker pulls Needs from the shared queue (which
is itself unordered — workers race for the next Need as available) and proposes
claims to the broker.
Priority enters at three points:
- Commit-conflict resolution (Commit broker). When two workers race for the same machine, the broker awards it to the higher-precedence claim. Precedence = (priority, interruption_penalty_bucket, reclamation_penalty_bucket) lexicographic.
- Phase 2 inversions. Same as today: post-commit, Phase 2 scans for inversions (higher-priority Need unsatisfied vs lower- priority Need satisfied) and preempts. With pure OCC inversions happen more often than under strict-pass, but Phase 2 already handles them; this is an increase in Phase 2 work, quantified in Open risks.
- Shortfall escalation. A Need that loses its retry budget
(Bounded retries → shortfall) goes to the shortfall buffer with its priority preserved;
coordinator escalation at Age >5 cycles is unchanged from today
(
bigfleet.md§9).
The cycle barrier holds all workers; when the shared queue drains and there are no in-flight proposals at the broker, the barrier releases. Phase 2/3 then run on the coherent post-barrier claimed-set. This barrier is the single synchronization point per cycle.
Commit broker
The broker is a single mutex-guarded data structure. Workers construct proposals that carry a mode field selecting the transaction semantics per [1, §3.4]:
type ProposalMode int
const ( // ModeIncremental: best-effort partial commit. Machines that // are still unclaimed at commit time are claimed; conflicted // machines are dropped from the proposal but the rest commit. // This is Omega's default and measured cheaper [1, §5.2]. ModeIncremental ProposalMode = iota
// ModeAllOrNothing: atomic commit. Either every proposed // machine commits or none do. If any one is conflicted, the // entire proposal fails and the Need retries from scratch. // Reserved for gang-scheduled Needs whose partial fill is // unacceptable. Measured ~2× more conflict-prone than // ModeIncremental [1, §5.2, Fig. 14a]; use sparingly. ModeAllOrNothing)
type Proposal struct { Need *needs.Need Machines []machine.ID Bucket bucketKey ObservedSeq uint64 Precedence precedence Mode ProposalMode}The broker’s main method follows a plan-then-commit discipline:
the entire proposal is classified against current state first
(without mutating anything), then mode-specific logic decides
commit-or-abort. This eliminates the need for rollback on
ModeAllOrNothing abort because no mutations happen until the
commit is known to succeed.
func (b *broker) Propose(p Proposal) Result { b.mu.Lock() defer b.mu.Unlock()
// 1. Fine-grained conflict check: has the bucket's seqno moved // since the proposer's read? if b.state.bucketSeq[p.Bucket] != p.ObservedSeq { return Result{Status: Conflict, ReObserve: b.snapshotBucket(p.Bucket)} }
// 2. Plan: classify each proposed machine without mutating state. // // newClaim — machine is currently unclaimed // displace — machine is claimed by a lower-precedence // incumbent we can displace // conflicted — machine is claimed by an incumbent we // cannot displace type displacement struct { mid machine.ID old claim } var newClaim []machine.ID var displace []displacement var conflicted []machine.ID for _, mid := range p.Machines { if _, claimed := b.state.claimed[mid]; claimed { inc := b.state.claimedBy[mid] if precedenceLess(inc.precedence, p.Precedence) { displace = append(displace, displacement{mid: mid, old: inc}) } else { conflicted = append(conflicted, mid) } } else { newClaim = append(newClaim, mid) } }
// 3. Mode-specific commit decision (still no mutations yet). if p.Mode == ModeAllOrNothing && len(conflicted) > 0 { // Atomic: any conflict aborts. Since planning didn't mutate // state, nothing to roll back. return Result{Status: Conflict, ReObserve: b.snapshotBucket(p.Bucket)} }
// 4. Atomically apply the commit. From here, state is mutated. // Displaced incumbents release their claim and re-queue with // their own retry budget decremented. for _, d := range displace { delete(b.state.claimed, d.mid) delete(b.state.claimedBy, d.mid) b.requeue <- d.old.need } // Newly-claimed (previously unclaimed) and displaced-and- // reclaimed machines now belong to this proposal. committed := append(append([]machine.ID(nil), newClaim...), midsOf(displace)...) for _, mid := range committed { b.state.claimed[mid] = true b.state.claimedBy[mid] = claim{need: p.Need, precedence: p.Precedence} } b.state.bucketSeq[p.Bucket]++
return Result{ Status: Committed, Committed: committed, Conflicted: conflicted, // empty for atomic commits; may be non-empty for incremental }}For ModeIncremental, conflicted may be non-empty even after a
successful commit — the worker receives the partial-commit set and
emits a shortfall for the residual deficit. For ModeAllOrNothing,
a successful commit always has conflicted empty (atomicity).
Per-Need mode selection. Workers set the mode at proposal construction based on Profile shape:
- Default:
ModeIncremental. Covers the vast majority of realistic-catalog Needs, where partial satisfaction is fine and any shortfall flows through the standard buffer. ModeAllOrNothingif the Need has amin_unitwhose satisfaction requires multiple machines AND the Need carries aSamerequirement (i.e., it’s gang-scheduled with co-location). This is the BigFleet equivalent of MPI-style gang scheduling [1, §3.4]. The classification is pure (function of Profile +min_unit); no operator-side hint needed.
Key properties:
- Incremental is the default; all-or-nothing is the exception. Per Omega’s measurement [1, §5.2, Fig. 14a], all-or-nothing roughly 2× the conflict fraction. Reserving it for genuine gangs (where partial fill is semantically wrong, not just suboptimal) is the right cost/benefit trade.
- Priority on conflict preserves the throttling property
required by
bigfleet.md§16. Higher-precedence proposals displace lower-precedence incumbents; displaced incumbents re-queue with their retry budget decremented. Displacement applies in both modes. - Per-bucket seqno + per-machine claim check is double- validated — the seqno catches “bucket state changed under me”, the per-machine check catches the actual claim race. Same mechanism as Omega’s broker [1, §5.2].
- Single mutex for the broker is intentional. The broker is not the bottleneck — per-proposal work is O(machines-in- proposal) = small constant (a single Need typically claims 1–10 machines). Workers spend most of their time computing proposals, not at the broker.
Bounded retries → shortfall
Each Need carries a retry budget (starts at 10; tunable via
--phase1-retries). On Conflict from the broker, the worker
re-reads the bucket’s state, recomputes the proposal, and
re-tries. On retry budget exhaustion, the Need is emitted to the
shortfall buffer with its full deficit vector.
This matches Omega’s “abandoned at 1,000 retries” behavior
[1, §4] but with a much tighter cap because BigFleet’s
shortfall protocol is a first-class concept (bigfleet.md §9 +
ADR-0027) — persistent contention is meant to be escalated to
the coordinator for cross-shard rebalancing, not absorbed by
spinning. Coordinator escalation kicks in at Age >5 cycles per
the existing protocol.
Retry budget:
- All Needs: 10 retries default. Most cycles see conflict rate ≤ 0.2; 10 is enough headroom for the long-tail. Tunable per shard via flag.
- Gang Needs (
ModeAllOrNothing) decrement the retry budget on EVERY failed atomic commit, not per-machine. A gang that hits the cap goes to shortfall as a whole; partial-credit doesn’t apply for them by definition. Coordinator escalation per the existing protocol handles cross-shard rebalancing.
Forward-progress guarantee: priority promotion on retry would also work (Q4 alternative) but adds per-Need bookkeeping and risks demoting newly-arrived high-pri work. Bounded-retries-to- shortfall is the simpler invariant and matches the existing architecture.
Phase 2 / Phase 3 interaction
The cycle barrier separates Phase 1 (concurrent OCC) from Phase 2 (inversions, sequential) and Phase 3 (reclaim, sequential). At barrier release:
- Claimed-set is coherent. Every committed claim is durable; every conflict-loser is either retried-and-committed, retried- and-shortfall’d, or in the shortfall buffer. No in-flight state.
- Bucket seqnos are stable. No further mutations during Phase 2/3; their reads are race-free.
- Phase 2 reads the claimed-set, scans for inversions, runs
victim scoring exactly as today (
bigfleet.md§8, ADR-0027). Inversions are expected to be more frequent under OCC than under strict-pass because the priority constraint is enforced reactively (at commit) rather than proactively (at ordering). See Open risks. - Phase 3 reads the claimed-set, computes reclaim per ADR-0027 stage 5.1 attribution rules. The attribution invariant (“Phase 1 and Phase 3 must attribute supply identically”) holds because both read the same post-barrier claimed-set.
The shortfall buffer is mutated only by the cycle’s Phase 1 (via broker re-queue) and read by the coordinator’s quota RPC asynchronously. No concurrent mutation outside the cycle.
Flow examples
Happy path — no conflict
Cycle start: snapshot = S; claimed = ∅; bucketSeq = all 0.
Worker W1 picks Need n1 (priority 1000, single-rack): - Reads bucket B1 from S; bucketSeq[B1] = 0 observed. - Computes proposal P1 = {need: n1, bucket: B1, machines: [m5], observedSeq: 0}. - Sends to broker.
Broker: - bucketSeq[B1] = 0 matches observedSeq = 0. ✓ - m5 not in claimed. ✓ - Commits: claimed[m5] = n1; bucketSeq[B1] = 1. - Returns Committed.
W1 emits Action{Bootstrap, m5, n1.cluster}.No retry; no conflict; one round-trip. This is the common case for steady-state cycles where most Needs find ample inventory.
Conflict path — two workers race, priority wins
Cycle start: snapshot = S; claimed = ∅; bucketSeq[B1] = 0.
W1 picks Need n1 (priority 100, single-rack on B1).W2 picks Need n2 (priority 10000, single-rack on B1).Both read B1 simultaneously; both observe bucketSeq[B1] = 0.
W1 computes P1 = {n1, B1, [m5], seq:0}; sends to broker.W2 computes P2 = {n2, B1, [m5], seq:0}; sends to broker (microseconds later).
Broker receives P1 first: - seq matches; m5 unclaimed. Commits. claimed[m5] = n1; seq[B1] = 1. - Returns Committed to W1.
Broker receives P2: - bucketSeq[B1] = 1, but P2.observedSeq = 0. Seqno mismatch. - Returns Conflict to W2 with current bucket snapshot.
W2 re-observes B1, sees m5 claimed but m7 unclaimed.W2 retries: P2' = {n2, B1, [m7], seq:1}. Sends to broker.
Broker receives P2': - seq matches; m7 unclaimed. Commits.
ALTERNATIVELY, if B1 had only one unclaimed machine and W2'spriority (10000) is strictly higher than W1's (100): - W2 retries with P2'' = {n2, B1, [m5], seq:1, displaceOK: true}. - Broker: m5 is claimed by n1 (precedence 100); P2'' precedence 10000. precedenceLess(100, 10000) = true. Displace. - claimed[m5] is reassigned to n2; n1 re-queued. - n1 will retry, find no unclaimed machine in B1, fall through to its retry budget, eventually shortfall.This is the path where priority-on-commit substitutes for strict- pass ordering. The outcome is equivalent (the higher-priority Need gets the resource); the cost is an extra round-trip and a Phase 2 inversion-or-not at barrier (if n1 is shortfalled, Phase 2 has no work).
Retry exhaustion → shortfall
Cycle start: B7 has 10 machines, all claimed by an earlier-arrivedbatch of priority-1000 Needs.
W1 picks Need n8 (priority 100, must use B7 due to single-rackconstraint).
Attempt 1: reads B7, all machines claimed, no proposal possible (avail = 0). Worker decrements retry budget: 10 → 9. (Note: avail=0 with no displacement opportunity is a trivial "conflict" — there's nothing to propose.)
Attempts 2–10: same; nothing changes (priority 100 cannot displace priority 1000 incumbents).
Attempt 11: retry budget exhausted. Worker emits n8 to shortfall buffer with deficit = n8.AggregateResources. UnsatisfiedNeed{Need: n8, Deficit: ...} added to Phase1Result.Unsatisfied.
Phase 2 reads the shortfall; in this case n8's priority is lowerthan every incumbent, so no inversion. Phase 3 attributes n8'sdeficit to the shortfall buffer for coordinator escalation.
Coordinator: if n8 ages past 5 cycles in shortfall, escalationmechanism (cross-shard reassignment or speculative quota bump)fires per bigfleet.md §9.This is the path that exposes “persistent contention” as a real signal. The system surfaces it through the existing shortfall protocol rather than absorbing it via unbounded retries.
Cold start with mixed-priority demand
The pure-OCC worry. Cycle 1 of a freshly-booted shard (or the arrival of a large deploy-burst on an established shard — see Open risks) sees a wave of Needs hit an empty or near-empty claimed-set.
Cycle start: 1000 Needs across priorities {100, 1000, 10000}; 5000 Idle machines across many buckets.
t=0: Workers (W1..W16) start pulling Needs from the shared queue. No priority ordering. Each Need's per-call cost averages ~5ms (per-Need decision work; see Performance projections).
t=0–200ms: Many workers concurrently scan/propose against overlapping buckets. Initial conflict rate is elevated (~0.3) because the claimed-set is empty and many proposals land on the same hot buckets. Per Omega [1, §5.1], cold- start conflict spikes are typical; conflict rate decays toward steady-state (~0.15) as the claimed-set fills out.
t=200–500ms: Workers grind through the queue. Each round-trip includes propose → conflict-or-commit → (on conflict) re-observe → re-propose. With conflict rate 0.3, average proposals per Need = 1.43; with 16-way concurrency and 5ms/proposal, 1000 Needs × 1.43 × 5ms / 16 ≈ 450ms.
t=500–800ms: Tail Needs: those that lost retry budget go to shortfall; rest commit. Total throughput dominated by the slowest worker's queue draining.
t=~800ms-1s: Barrier releases when queue is empty and no in-flight proposals remain at the broker. Phase 2 scans for inversions (most high-pri Needs already won via commit-broker displacement, so Phase 2's inversion list is small — but it does have work cleaning up Needs that were displaced and shortfall'd).
t=~1s: Phase 3 runs. Reclaim attribution mirrors Phase 1's post-barrier state.
Total cold-start cycle: ~500ms–1s, vs ~5 minutes single-threaded.Earlier drafts of this section quoted ~120ms for this scenario; that number undercounted both conflict-retry overhead and the broker mutex contention under maximal cold-start concurrency. The revised projection is ~500ms–1s, derived from 1000 × 1.43 × 5ms / 16 + barrier overhead. Steady-state cycles after cold-start are much faster (~100ms range, see Performance projections).
Cold start is the worst case for OCC because contention is maximal. The Omega paper’s measured conflict rate stays under 0.2 in steady state and rises to 0.3–0.5 in cold-start equivalents [1, §5.1]; BigFleet’s bucket structure (high-cardinality, low per-bucket demand under the realistic catalog) should keep us toward the low end of that envelope. Quantified expectation in Performance projections.
Invariants
The set of invariants this design preserves, weakens, or introduces:
Preserved verbatim from today:
- Every machine is claimed by at most one Need per cycle.
- Phase 1 emits a coherent (Actions, Unsatisfied) pair at cycle end.
- Phase 3 attribution mirrors Phase 1’s attribution (ADR-0027 stage 5.1).
- No distributed locking on the hot path (
bigfleet.md§16). - Cost formula unchanged (
bigfleet.md§4). - Provider RPC surface unchanged (
bigfleet.md§7). - Static stability: cycle runs autonomously if coordinator absent.
- Shortfall protocol per
bigfleet.md§9 (escalation at Age > 5).
Weakened from today (intentional, see Open risks for impact):
- Strict priority-ordered Phase 1 traversal. Today: high-pri Needs are evaluated before low-pri, see fresh inventory. After this ADR: all Needs are evaluated concurrently; priority is enforced at the commit broker via displacement. The outcome (high-pri wins on contention) is preserved; the mechanism (pre-ordering vs reactive conflict resolution) is not.
- Determinism across cycles. Today’s single-threaded Phase 1 is fully deterministic for a fixed snapshot + NeedsTable. Under OCC, two cycles with identical inputs may produce different Action sequences depending on worker scheduling. The outcome (which Needs satisfied, which shortfall’d) is deterministic up to commit-ordering of ties; the Action order within a result is not. Sim goldens that assert exact Action sequences need to be updated to assert outcome-equivalence.
Newly introduced:
- Commit broker is a per-cycle singleton; only mutator of shared claimed-set + bucket-seqno state. Removed at cycle end.
- Per-bucket sequence numbers are the conflict-detection primitive. Incremented on every claim that touches the bucket. Snapshot-isolated reads observe a consistent (bucket, seqno) pair from the broker’s last successful commit prior to the read.
- Worker retry budget per Need (start 10, tunable). Exhausted budget → shortfall.
Performance projections
Omega-paper baseline
[1, §4.3] measured shared-state OCC scaling to ~32 batch
schedulers without conflict-rate breakdown for Google’s
production-trace workload (cluster B, 29-day trace). Conflict
fraction stayed ≤ 0.2 in the realistic operating envelope.
[1, §5.1] identified t_decision × λ × N driving conflict
fraction past 1.0 as the saturation point.
BigFleet projection (uber-* ladder under realistic catalog)
Note on catalog calibration: the numbers in the table below
are projected against the prior (uncalibrated) realistic.yaml
catalog whose NeedsTable shape was measured at the scale-test bench baseline
(~388 Needs/cluster, dominated by sameRack groups misclassified
as gangs). The scale-test review recalibrated the catalog against
industry production patterns (~70% tiny-stateless long tail, true gang
fraction ~3%, ~42% topology-spread carrying). Re-baselining
against the corrected catalog is the immediate post-merge work
(see Migration plan). The projections here are conservative —
the corrected catalog has fewer gang Needs and more independent
small Needs, both of which reduce per-Need cost and conflict rate.
Assumptions:
- 16 workers in one shared pool on a 16-core machine for Phase 1
(
runtime.GOMAXPROCS()default). Phase 3 runs in a single goroutine after the cycle barrier — not part of Phase 1’s worker pool. - Conflict rate ≤ 0.3 in cold start, ≤ 0.15 in steady state. BigFleet’s per-bucket demand under the realistic catalog is lower than Google’s batch-vs-service split, reducing contention. Production NUMA hardware may push these numbers up at uber-500k+ (see Open risks).
- Per-Need decision cost ≈ today’s per-Need cost minus the score-loop allocation overhead (~30%, per parsed-form measurement). Effective per-Need cost: ~7 ms × 0.7 ≈ 5 ms in the realistic regime; ~130 µs × 0.7 ≈ 90 µs in the aggregated regime.
- Per-Need cost is tri-modal in practice (fast-absorb / fast-shortfall / slow-score), not unimodal. The slow-score path dominates wall-clock and is what the projections target; fast-absorb and fast-shortfall paths complete in microseconds regardless of OCC concurrency.
| Profile | NeedsTable | Sequential cost (today) | OCC concurrency factor | Projected cycle p99 |
|---|---|---|---|---|
scaleway-500k (aggregated) | ~800 | ~70 ms | ~14× | ~5 ms |
uber-5k (realistic) | 7,759 | 1.02 s | ~14× | ~75 ms |
uber-50k (realistic) | 42,680 | ~15 min | ~10× | ~90 s → ~9 s with retry-shortfall pruning |
uber-500k (realistic) | 776,000 | ~100 s extrapolated | ~10× | ~10 s |
uber-1m (realistic, 2 shards) | per-shard ~776,000 | ~100 s extrapolated | ~10× | ~10 s per shard (same per-shard cost as uber-500k) |
uber-5m (realistic, 10 shards) | per-shard ~776,000 | ~100 s extrapolated | ~10× | ~10 s per shard (same per-shard cost as uber-500k) |
Caveats:
- uber-50k requires retry-shortfall pruning. A naive OCC pass on 42K Needs at 5 ms/Need × 1/10 concurrency = 21 s. The retry-shortfall path takes ~10–20% of Needs out of the cycle (the persistently-contended ones), pulling cycle time closer to 9 s. This is within ADR-0028’s regime-aware envelope (25 s for uber-50k).
- uber-500k projection is tight. 776K Needs × 5 ms / 10 concurrency = ~390 s without pruning; aggressive pruning (40% of Needs to shortfall) brings it to ~10 s. If real conflict rate is closer to 0.4 at this scale, an additional structural intervention may be needed — but ParSync is the wrong tool here (it scales a single scheduler beyond the per-shard ceiling, which BigFleet caps at 500K machines anyway). uber-1m / uber-5m run additional shards rather than larger shards.
- Aggregated regime is unaffected. scaleway-* runs already pass the 100 ms canonical bar; OCC just makes them faster (~5 ms projected). No regression risk in the existing happy path.
Measurement plan
Day-1 metrics to add (Prometheus):
bigfleet_shard_phase1_occ_conflict_fraction{mode}— conflicts per successful commit, per proposal mode (incremental | all-or-nothing). Target ≤ 0.2 steady-state; alert at ≥ 1.0 (Omega’s saturation indicator).bigfleet_shard_phase1_occ_retries_total{mode, outcome}— histogram of retry counts at commit, per proposal mode. Tail behavior is the signal.bigfleet_shard_phase1_occ_displacements_total— how often priority-on-commit displaces an incumbent. Phase 2 inversion work scales with this.bigfleet_shard_phase1_proposal_duration_seconds{mode}— per-proposal wall-clock at the broker. Identifies whether all-or-nothing commits dominate worker time.
Validation profiles before merge:
scaleway-500k— regression gate. Must still pass under 100 ms.uber-5k— must pass under 100 ms (new bar, not the old 1 s).uber-50k— must pass under 25 s (ADR-0028 regime envelope).- Local benches —
BenchmarkPhase1_Uber5K_LateRunand the uber-50k bench from the reverted commit, both extended with conflict-rate assertions.
Migration plan
Full-replace cutover, no flag. Q3 answered: invariants either hold or they don’t; flagging the old path is a maintenance burden and a correctness risk. The sim goldens are the safety net.
Sequence:
-
New package
pkg/decision/occ/with the broker, shared- state, and worker pool. Built and unit-tested in isolation from the production Phase 1 path. ~1–2 weeks. -
Wire OCC into Phase 1. Replace
phase1Allocator’s per-Need iteration with the OCC dispatch. Phase 1’s outer function (Phase1inphase1_assign.go) becomes:func Phase1(snap *inventory.Snapshot, allNeeds []needs.Need) Phase1Result {state := occ.NewSharedState(snap)broker := occ.NewBroker(state)results := occ.RunCycle(allNeeds, broker) // dispatches workers, returns at barrierreturn results.ToPhase1Result()}creditExistingSupplyruns first (unchanged); its claims are the initial state of the claimed-set. -
Sim golden updates. Goldens that assert exact Action sequences relax to outcome-equivalence (set of Actions, not sequence). Goldens that assert per-cycle determinism are updated to assert outcome-determinism modulo commit ordering.
-
make scaleregressions. All scale bench targets re-run under the new code; must show no regression on aggregated- regime benches and material improvement on realistic-regime. -
Cloud regression + win gate (uber-5k + uber-50k). File a scale-test follow-up once code is on
main. Bench runners report per-Need cost, conflict rate, cycle p99. uber-5k is the 100 ms canonical-bar pass; uber-50k is the 25 s realistic- regime envelope per ADR-0028. These two rungs cover both the regression check and the OCC win demonstration; uber-50k is the highest auto-runnable rung. uber-500k+ would extend the per-shard ceiling claim but requires prior Uber approval and is filed separately if/when that comes through. -
Documentation update.
bigfleet.md§8 (“walk needs top-down by priority”) gets a footnote pointing to this ADR; the implementation language is updated to reflect that priority is enforced at commit. Paper-level text doesn’t need a full rewrite — the design intent (priority wins) is preserved.
Rollback plan: if uber-5k or uber-50k regresses post-cutover, revert the cutover commit. Sim goldens running on the reverted code are the immediate signal. Time-to-detect: one CI run (~10 min). No intermediate dual-path code to remove.
Open risks / known limits
Cold-start churn. Pure OCC at cold start sees maximal contention; conflict rate spikes; Phase 2 has unusual amounts of inversion work. Measured-but-not-disastrous in the cold-start flow example; worth watching in the first uber-50k run. Mitigation if it matters: stagger worker startup (1 worker for 50 ms, then add the rest) so the first batch of commits stabilizes the claimed- set before full concurrency.
Phase 2 work increases. With pure OCC, priority enforcement is reactive (at commit) rather than proactive (at ordering). Some fraction of cycles will see displacement events that Phase 2 then has to process for inversion resolution. Quantification: today’s Phase 2 cost is ~0.008 s/cycle (from scale-test bench measurement); the increase is bounded by the displacement-rate × per-inversion cost. Expected increase: ~10–50× the current rate, i.e. ~0.1–0.5 s/cycle. Still negligible vs cycle p99 budget. If displacement-rate is unexpectedly high, the priority promotion option from Q4 is the fallback.
Multi-shard rungs inherit the per-shard envelope, not a larger
one. uber-1m and uber-5m run additional shards (2 and 10
respectively) rather than larger shards — bigfleet.md §5 / §11
caps a shard’s machine count at the 500K ceiling. Per-shard
NeedsTable at every multi-shard rung stays at uber-500k’s ~776K.
The OCC envelope projected for uber-500k therefore carries through
unchanged across uber-1m and uber-5m. The risks that DO appear at
multi-shard scale are coordinator-side (cross-shard quota
arbitration, shortfall escalation rate) and out of scope for this
ADR.
Determinism loss. Sim goldens that asserted exact Action sequences need to relax to outcome-equivalence. This is a correctness evidence loss, not a correctness property loss — the system still produces correct outcomes deterministically up to commit ordering. Tests that depended on exact Action ordering for debugging are weakened.
Worker oversubscription on small machines. 16 Phase 1 workers
plus 1 Phase 3 worker = 17 goroutines on a 16-core dev box is
mildly oversubscribed. Acceptable for the M5 Max dev box; the
production shard hardware budget is per-shard and tunable. Worker
counts are configurable via the --phase1-workers flag.
PodDisruptionBudget (PDB) gap. Production K8s workloads carry
PDBs (minAvailable: 2, maxUnavailable: 25%) that bound
concurrent disruption. Today’s BigFleet preemption model is
“priority wins, full stop” — it does not consult PDBs anywhere
in the Phase 2 victim-scoring path or the coordinator’s cross-
shard preemption. This ADR’s commit-broker displacement adds
more displacement opportunities (every cycle’s broker can
displace, not just Phase 2) and therefore exacerbates the
existing gap. Not introduced by this ADR but worsened by it.
Surfaced during scale-test review. Fix is a separate follow-on ADR
that designs PDB-respecting preemption uniformly across Phase 1
commit-broker, Phase 2 victim scoring, and coordinator cross-
shard preemption.
Preemption cascades. When a high-priority Need displaces a lower-priority incumbent, the displaced incumbent re-queues and may itself displace another lower-priority incumbent. Under bursty high-priority arrivals, this creates chains of re-queues. The Phase 2 work-increase estimate above (~10–50× current rate) may undercount cascades; a single displacement can trigger N re-queues whose costs compound. Mitigation if observed: displacement-rate metric (already proposed in Measurement plan) flags it; the per-Need retry budget bounds the cascade depth (each re-queue decrements; eventually shortfalls).
Deploy-burst is the recurring cold-start event. The ADR’s
cold-start analysis treated cold-start as a once-per-boot event.
In production (per scale-test review) deploy-bursts — service
rollouts creating 50–500 Pods over 30 seconds, batched by canary
% — happen many times per day per cluster and produce the same
contention shape as a fresh boot (many Needs arriving at the
same buckets simultaneously). The cold-start projection
(~500ms–1s) should be read as “every-deploy-burst worst case,”
not “once at startup.” The churnPerMinute field in profiles
(currently 0.02) understates this; modeling deploy-bursts
explicitly is realistic-catalog follow-on work.
NUMA broker mutex contention at uber-500k+. Single mutex on
the commit broker is fine at uber-50k (~2,600 proposals/sec) but
on multi-socket production shard hardware the cross-NUMA cache-
coherence overhead (~100–500 ns per lock acquisition) becomes
non-trivial at uber-500k+ proposal rates. Monitor via the
bigfleet_shard_phase1_proposal_duration_seconds histogram;
the fallback is per-bucket-shard broker partitioning (rather
than a single broker), which is a localized change.
Realistic-catalog calibration. The Performance projections table was built against the pre-calibration catalog (388 Needs/cluster, sameRack-misclassified-as-gang dominated). The scale-test review recalibrated the catalog to industry production patterns (70% tiny long-tail, ~3% true gangs, ~42% spread-carrying); the re-baseline run is the immediate post-merge work. Expected: the corrected catalog reduces both per-Need cost (more independent small Needs) and conflict rate (fewer gang Needs hitting hot buckets simultaneously). Numbers will tighten favorably, not adversely.
Per-Need cost is tri-modal, not unimodal. Production data suggests three regimes:
- fast-absorb: Need matches existing Configured supply →
creditExistingSupplycovers it without Phase 1 work - fast-shortfall: bucket fully claimed → take returns nil immediately
- slow-score: real work — score loop walks unclaimed machines in the chosen bucket
Performance projections target the slow-score path, which dominates wall-clock under realistic workloads. The other two paths complete in microseconds regardless of OCC concurrency. Worth noting because cycle-time histograms will show the trimodality and shouldn’t be misinterpreted as bimodal cost.
Coordination cost with concurrent Phase 1 calls. Today’s Phase 1 is single-threaded per shard; there is no inter-cycle concurrency. This ADR keeps that property — only intra-cycle is concurrent. If a future ADR wants Phase 1 to run as a continuous loop (rather than discrete cycles), this design needs revisiting.
Future work
-
Incremental Phase 1 (delta-only processing). Track which Needs / inventory changed since the previous cycle; only those participate in the new cycle’s OCC dispatch. Reduces steady- state work toward the actual churn rate. Complementary to OCC; worth its own ADR after measuring steady-state churn on uber-50k+ under OCC.
-
ParSync partitioned synchronization. Layer staleness- bounded partition rotation on top of the per-bucket seqno conflict detection. Only becomes interesting if a future ADR raises the per-shard machine ceiling above 500K — at today’s cap, BigFleet scales horizontally via sharding instead. References: [3, §3].
-
Multi-shard OCC. Cross-shard coordination is currently bounded by the coordinator’s quota RPC. If shortfall escalation becomes a hot path under realistic catalog at scale, a future ADR may revisit cross-shard locking primitives. Out of scope here.
-
Adaptive worker counts. Today’s worker count is static. ParSync’s “adaptive strategy” [3, §4] adjusts per cycle based on observed conflict rate; worth considering once we have measured data on conflict-rate variability across workloads.
-
Scheduler-kind partition (Omega-style easy/picky split). Discussed and rejected in Alternatives considered on uniform- per-Need-cost grounds. If production telemetry surfaces a bimodal cost distribution (e.g., gang Needs become significantly more common and significantly more expensive than the median), splitting into per-kind worker pools — with mode-specific retry budgets and conflict-rate metrics — is the next architectural step. The current single-pool design is forward-compatible: workers already carry a per-Need mode flag, so adding a queue per mode is a localized change.
Alternatives considered
Band-serial / OCC-within-band. Preserves the existing
implementation’s implicit strict-pass invariant most faithfully.
Rejected because the close reading in Context showed strict-pass
is not actually required by bigfleet.md — “priority is the sole
throttling mechanism” (§16) is about how throttling decides, not
when in the cycle. The serial barriers per band cost intra-cycle concurrency
for no contractual gain.
Replica-only concurrency (hash(cluster_id) sharding). Multiple identical workers each handling a hash partition of Needs. Rejected per [1, §5.1] and [3, §3]: this inherits Omega’s sub-linear ~3× scaling ceiling and doesn’t address head-of-line blocking when one partition’s work runs long. The papers explicitly warned against this.
Omega-style scheduler-kind partition (easy / picky). Separate worker pools per Need shape, modelled on Omega’s batch/service split. Rejected on workload grounds: the scale-test bench measured uniform per-Need cost across the realistic catalog, not the bimodal distribution that motivates Omega’s split. Queue isolation between kinds doesn’t reduce conflict rate (kinds share the same inventory) — it only protects against head-of-line blocking that we haven’t measured. The Omega-faithful transaction- semantics distinction (incremental vs all-or-nothing) is preserved as a per-proposal mode at the broker, which captures the asymmetric cost without splitting goroutine pools. Add scheduler kinds back as a follow-on ADR if a bimodal cost distribution emerges in production telemetry.
Continued constant-factor optimization of single-threaded Phase 1. Three attempts already documented in ADR-0028 empirical addendum; all failed because bucket count ≈ Need count under realistic catalog kills constant-factor levers. Empirically rejected.
Drop realistic catalog (tune the workload to the implementation). Rejected in ADR-0028 §Alternatives. The realistic catalog is what makes BigFleet’s testing meaningful for production deployments; tuning it to make our scheduler look fast would be cherry-picking.
Centralized scheduler via etcd transactions (K8s scheduler
analog). Considered briefly. Rejected because (a) K8s
scheduler is centralized and single-instance for a reason — its
workload is per-Pod, not per-aggregate, and the OCC retry rate
is bounded by Pod arrival rate, not by Need × inventory; (b)
BigFleet has no etcd dependency on the hot path
(bigfleet.md §16: “no distributed locking on hot path”); etcd
would violate that
even if the design were otherwise comparable.
References
- Schwarzkopf, Konwinski, Abd-El-Malek, Wilkes. Omega: flexible, scalable schedulers for large compute clusters. EuroSys 2013. https://people.csail.mit.edu/malte/pub/papers/2013-eurosys-omega.pdf
- Burns, Grant, Oppenheimer, Brewer, Wilkes. Borg, Omega, and Kubernetes. ACM Queue 14(1), 2016. https://queue.acm.org/detail.cfm?id=2898444
- Feng, Lu, Bao, Wang, Lin, Wang, Li, Li. Scaling Large Production Clusters with Partitioned Synchronization. USENIX ATC 2021 (best paper). https://www.usenix.org/conference/atc21/presentation/feng-yihui