The NeedsTable: profiles, penalty bucketing, and Same folding

The NeedsTable is the shard’s demand side — pkg/needs/. It holds, per cluster, the aggregated capacity demand the decision engine diffs against inventory each cycle. Two operations matter on the hot path: Replace(clusterID, needs), which swaps one cluster’s entire contribution atomically (every roll-up is the cluster’s complete desired state — never a merge), and Snapshot(), a priority-sorted slice the worker walks top-down. This doc is the why behind the table’s shape: why demand is a constrained aggregate resource request rather than a pod count, why the aggregation key (Profile) buckets penalties to powers of two, why full replacement is the correctness substrate the whole convergence loop rests on, and how sub-machine Same-gangs are folded away so the machine-granular claim ledger stays truthful. It assumes you’ve read docs/architecture.md and the decision-engine doc; it does not re-explain the three phases or the OCC claim ledger.

What a Need is, and what it is not

A Need (needs.go:387) is one row: one cluster’s aggregate demand for one Profile. Its payload is two resource vectors and an identity:

AggregateResources — the vector sum of the per-replica requests of every CapacityRequest that aggregated into this Need. The total CPU/memory/GPU/ephemeral the cluster needs satisfied within this constraint set.
MinUnit — the largest single atomic schedulable unit among them; the indivisibility floor every provisioned machine must individually be able to host. The only residue of “pod shape” that survives aggregation.
Profile — the aggregation identity (below). Pure key, no payload.
Group, AcquisitionParked, ArrivalUnixNanos — co-location identity, shard-local parking state, and the age clock; covered later.

What is conspicuously absent is a count. There is no “N pods of shape X” anywhere on this row, because machine count is BigFleet’s output, never the cluster’s input. This is the load-bearing decision of ADR-0027, and it is worth understanding the two-step history that produced it.

ADR-0022 first corrected a narrower drift: the implementation had treated Need.Count as a machine count, emitting one Bootstrap per pod (phase1_assign.go did deficit := n.Count - fromSupply and emitted deficit machines). The papers say one CR per pod, roll-up aggregates, and BigFleet “diffs aggregates against provisioned inventory” — so Count had to mean pod count, and Phase 1 had to compute machine count from a resource-space diff (Profile.Resources × Count against Σ Machine.Allocatable). That fixed “many pods, one Profile.”

ADR-0027 then removed count entirely. The (Profile, Count) row structure forced one row per distinct pod-shape, and per-shape rows recreate exactly the per-fingerprint supply partition that breaks under shared In-eligibility: kube-scheduler places any pod whose requirements match a node, so one machine provisioned for fingerprint X also absorbs Y, Z, W sharing an instance-type-family In list — but the supply accounting credited each fingerprint as if its machines were dedicated, phantom-multiplying supply, masking real deficits (shortfalls=0 while pods were genuinely stuck), and breaking the convergence loop. The fix was structural: demand becomes a constrained aggregate resource request — a resource envelope (AggregateResources) plus the constraints on machines that satisfy it (Profile.requirements) plus the indivisibility floor (MinUnit). Phase 1 sums Σ Machine.Allocatable over matching machines once per machine, no per-fingerprint partition, and the deficit is real again. The per-replica pod shape leaves the wire entirely; it survives only as MinUnit.

This is also what makes the three-layer separation clean (ADR-0027): kube-scheduler speaks pods→nodes, the operator translates workloads→constrained aggregate demand, BigFleet turns demand→cost-optimal procurement. BigFleet never sees a pod. The operator absorbs the translation (pkg/operator/rollup.go:151: one CR = one pod’s worth, both its AggregateResources and its MinUnit; needs.Aggregate then sums the former and maxes the latter), so the BigFleet contract is durable across workload primitives — Pods today, Jobs/VMs/the Workload API next.

Profile: the aggregation key

Profile (needs.go:239) is pure identity — the bundle of fields that make two CapacityRequests collapse into one Need. Per ADR-0027 it carries no resource shape; demand lives in the Need’s vectors, not the key. The identity is:

(requirements, spread, priority, interruption-penalty-bucket, reclamation-penalty-bucket)

NewProfile (needs.go:250) canonicalises on construction — requirement values are sorted, requirements are sorted by (key, operator), spread is sorted by topology key — then computes a string fingerprint once and caches it (computeFingerprint, needs.go:324). Two semantically-equal Profiles therefore produce byte-identical fingerprints (needs_test.go:35), and map-keyed aggregation never re-walks the slices. RequirementsRO() (needs.go:309) exists because the defensive-copy in Requirements() showed up in the profile: at ~450K Phase 3 candidates × one alloc/call the GC pressure visibly walked the cycle, so hot match paths take the no-copy accessor under a no-mutation contract.

Note both penalty axes are in the key, and both are distinct penalties (paper §16): interruption_penalty is the cost of interrupting the workload (it feeds effective_cost and victim scoring); reclamation_penalty is the operational value tied to the specific machine–workload pairing (it feeds the idle tiebreak, victim scoring, and Phase 3 release). They are not the same field, not derivable from priority, and there is no operational_value field. Both are load-bearing in the key (ADR-0027): because a CapacityNeed is homogeneous in both buckets, every machine provisioned for it inherits an unambiguous stamped penalty pair for downstream phases to read. Drop either axis and the stamp becomes arbitrary.

Penalty bucketing: powers of two, plus a Pinned sentinel

Raw penalties are workload-specific dollar amounts. If they participated in the aggregation key raw, every workload would defeat aggregation — the roll-up would explode to one Need per distinct penalty. So penalties are bucketed before they reach the Profile (plan §0.1). PenaltyBucket (needs.go:30) mirrors the proto enum PenaltyBucket (api/proto/bigfleet/v1alpha1/capacity.proto:136) one-for-one, with matching numeric ordering so range checks behave identically on both sides of the wire. The ladder:

PenaltyBucketZero — $0.
PenaltyBucketHalfDollar — (0, $0.50].
PenaltyBucket1 … PenaltyBucket8388608 — powers of two from $1 up to $8,388,608 (≈ $8.4M); each bucket’s upper bound is 2^(bucket − PenaltyBucket1).
PenaltyBucketPinned — anything > $8,388,608: treated as effectively infinite / non-interruptible.

BucketForDollars(dollars) (needs.go:159) returns the smallest bucket whose upper bound is ≥ the raw value — $0.51 → PenaltyBucket1, $1.01 → PenaltyBucket2, 8000 → PenaltyBucket8192, 20M → PenaltyBucketPinned (needs_test.go:11). The operator calls it on each CR’s declared penalties at roll-up time (rollup.go:318), so the raw dollars never reach the Profile and CRs are aggregated at bucket boundaries: two workloads whose penalties land in the same bucket collapse into one Need even though their exact dollar figures differ.

The inverse, UpperBoundDollars() (needs.go:133), turns a bucket back into a number for cost math, returning the bucket’s upper bound — a deliberate over-estimate. PenaltyBucketPinned returns +Inf so any comparison against it picks the alternative. The conservative direction is intentional: overestimating the cost of interruption biases the engine toward more-expensive but safer capacity. Label() (needs.go:67) gives a stable Prometheus-label string so FinOps penalty-bucket breakdowns stay readable in Grafana without leaking the uint8 (M25).

The bucketing also feeds the CRD surface: CapacityRequests are aggregated at bucket boundaries and the bucket is published on the CRD (plan §0.1), so the coordinator’s per-bucket bookkeeping operates on ~27 discrete classes per penalty axis, not a continuum of raw dollars.

Aggregation and the resource-vector arithmetic

Aggregate(in) (needs.go:404) is the shared roll-up primitive. It groups by (cluster, profile-fingerprint, group): AggregateResources sum per group, MinUnit takes the per-group max, earliest non-zero ArrivalUnixNanos wins (so age math stays accurate). CRs from the same workload (same Group) collapse; CRs from different workloads stay separate even when their fingerprints match. The operator folds raw per-CR observations into wire-level Needs with it (rollup.go:179); tests use it directly (needs_test.go:86).

The vector ops live in resources.go and exist because demand is a constrained aggregate resource request (ADR-0027) — the hot path has to add, max, sub, and compare resource vectors. AddResources sums per dimension; MaxResources folds atomic units into MinUnit; SubResources is the saturating a − b (negative dimensions clamp to zero) that Phase 1 uses to turn demand-minus-supply into a deficit; Covers and IsZero answer the “is this Need satisfied” question; ScaleResources turns a per-replica shape × replica count into an aggregate. All results are canonical (sorted by name) so they round-trip through Need construction stably; a dimension absent from a vector reads back as zero; quantity strings that don’t parse degrade to zero (the operator canonicalises at CR-aggregation time, so malformed values aren’t expected on the hot path). These are thin wrappers over k8s.io/apimachinery/pkg/api/resource.Quantity, which is why the demand side speaks Kubernetes quantity strings end-to-end.

Full replacement: why it is the correctness substrate

Replace(cluster, n) (needs.go:449) swaps a cluster’s entire contribution under the table lock. An empty or nil slice deletes the cluster’s key — withdrawal is just an empty roll-up. There is no merge, no diff, no partial update path. This is not an optimisation; it is the property the entire convergence loop depends on.

Each ClusterCapacityNeeds message is the cluster’s complete desired state (paper §6). Because of that, BigFleet’s internal supply accounting can never silently override the cluster’s ground truth: if pods are stuck, the next full-replacement roll-up still lists them, and the only way to make them disappear is to actually satisfy them. The ADR-0027 bug is the cautionary tale of what happens when internal math is allowed to mask the roll-up — shortfalls=0 while pods were genuinely stuck, because phantom supply credit zeroed the deficit and the convergence loop broke. Full replacement is what guarantees the deficit, once the supply math is correct, re-asserts itself every cycle until it is genuinely closed. It is also what makes withdrawal trivially correct: a workload that scales to zero simply stops appearing, and the table forgets it on the next replace.

The shard’s ApplyRollup (shard.go:429) is the single ingest point that calls Replace, and when Config.EmptyRollupGuard is set it wraps the call with a quarantine guard (ADR-0046): a full-replacement roll-up that would erase most of a cluster’s previously accepted demand is held for a few consecutive reports before being applied — the previous demand stays active until the drop is confirmed — so a single corrupt or truncated roll-up can’t trigger a fleet-wide reclaim storm. The guard never merges; it only delays applying a suspicious full replacement. Full-replacement semantics are preserved; only the timing of a destructive replacement is gated. (A held roll-up still clears the ADR-0036 first-rollup Phase 3 gate — the operator did report.)

Snapshot: priority-sorted, arrival-tiebroken, cached

Snapshot() (needs.go:473) flattens byCluster into one slice sorted by priority descending, then arrival ascending (older first), then cluster ID (needs_test.go:172). Priority-descending is the paper’s §8 walk order — the engine assigns the most important demand first; arrival-ascending within a priority is the fairness tiebreak (a Need that has waited longer wins); cluster ID is the final determinism tiebreak so the order is reproducible across runs (needs_test.go:257). Priority is the sole throttle — there is no quota or admission gate ahead of this sort.

The snapshot is cached and invalidated by a dirty flag: a clean read returns a copy of the cache without re-sorting; the first read after any Replace rebuilds. The rebuild sorts an []uint32 permutation and gathers, not the []Need structs themselves — because the cloud profile of uber/scaleway-50k caught this function burning ~25% of all shard CPU (ADR-0019 / M44.4): a reflection-based sort.SliceStable over 50K Need structs (~128 B each, with embedded slices) drove typedmemmove to dominate. Sorting indices moves 4 B per swap instead of ~128 B; BenchmarkTableSnapshot_50K (snapshot_bench_test.go:19) is the regression guard. Every returned slice is an independent copy the caller may mutate freely (needs_test.go:209); Replace/Snapshot/Stats are all lock-guarded and race-clean (needs_test.go:234).

Same: cross-machine topology, and the sub-machine fold

The Same operator (OperatorSame, needs.go:189) is co-location. The operator appends it to a Profile when a CR declares co-location (rollup.go:163, withSameRequirement), projecting the source pod’s required podAffinity onto the term’s topology key. It is protobuf-only — the CRD uses the standard operators (In/NotIn/Exists/DoesNotExist), and the operator translates the co-location signal to Same during roll-up (a BigFleet protocol rule). Crucially, a Same request that cannot be satisfied within a shard becomes a shortfall; it is never resolved cross-shard (paper §16). Topology constraints do not cross shard boundaries.

Same interacts with the rest of the table through two ADRs:

ADR-0040 — unified domain attribution. A Same requirement means “served by one topology domain or not at all.” The bug it fixed was that acquisition was strict about this (bucket candidates by the Same key’s value, take the single best bucket) while the supply-crediting sites — Phase 1’s pre-pass and Phase 3’s claim walk — used bare MatchProfile, which is per-machine and vacuous for Same (pkg/decision/match.go: “group-wide co-location is enforced by the autoscaler when picking nodes — not at this match step”). So crediting counted supply across domains while acquisition demanded it within one. The two phases never even disagreed about the same machines: Phase 1 saw scattered per-domain supply as unsatisfiable and kept bootstrapping toward gangs it could never finish; Phase 3 saw the same Needs satisfied cross-rack and reclaimed the scatter — a self-sustaining Bootstrap≈Reclaim equilibrium. The fix makes every supply-crediting site domain-aware, and (the Addendum) chooses the Same domain once per Need per cycle, jointly over creditable and acquirable supply, confining both credit and acquisition to it so Phase 1 can’t assemble a cross-domain group that Phase 3 then tears apart. This is what state.SameDomainFor(n) and ChooseSameBucket enforce on the decision side (occ/cycle.go:316); the NeedsTable’s contribution is simply that the Same requirement is part of the Profile fingerprint, so each co-location group is its own Need with its own domain coherence.

ADR-0041 — sub-machine Same-Needs fold into atomic aggregates. Making each co-location group its own Need (ADR-0024) and giving every pod a CR (ADR-0039) produced ~2,400 gang Needs at uber-5k — and at density 10–100 a gang of 3–8 pods is a fraction of one machine. The machine-granular claim ledger is exclusive (it claims whole machines), so each sub-machine gang up-rounded to a whole exclusive machine: ~2,400 gangs demanding ~2,400 machines where ~540 exist, while kube-scheduler happily packs many gangs per machine. The signature was bind 97.5% but 78% of gangs reported unsatisfiable, Configured inflated +48%. The insight: a gang whose entire aggregate fits on one machine does not need Same at all — any single machine with room hosts it, so co-residency is automatic. Same is for the genuinely cross-machine case: a gang too large for any machine, whose machines must share a domain.

So the shard runs decision.NormalizeDemand(snap, demand) (normalize.go:61) once per cycle, before the phases, on the same snapshot all three will consume. A Same-carrying Need is foldable iff at least one matching machine in the snapshot (the cluster’s Configured/Configuring, or shard-wide Idle/Speculative) has allocatable covering the Need’s whole AggregateResources. Foldable Needs of the same (cluster, stripped-Profile fingerprint, aggregate value) fold into one plain Need: the Same requirement is stripped, AggregateResources summed, and MinUnit becomes one gang’s aggregate — the Fleet-Scale Kubernetes §7 atomicity floor, so the vector math counts only machines that can host a whole gang. Needs that fit no machine keep their per-gang Same Need. The fold:

restores the claim ledger’s exclusivity assumption — folded aggregates consume many whole machines, surviving Same-Needs genuinely consume whole machines, no broker change;
collapses cardinality from O(co-location groups) to O(fingerprints × gang sizes), deflating cycle cost and most of the roll-up-size tension against the paper’s §11 fixed-size claim;
is conservative under fragmentation (counts only whole-gang-fitting machines, so BigFleet may slightly over-provision but never overstates feasibility) and snapshot-dependent (recomputed per cycle; if the largest matching machines vanish, the class reverts to per-gang Same-Needs). Foldability is memoised per class so the pool walks happen once per class, never once per gang (normalize.go:159).

NormalizeDemand has a fast path: a cycle with no Same-carrying Need returns the input untouched, no copy (normalize.go:62).

Group, AcquisitionParked, and arrival

Three Need fields exist for the co-location and aging machinery without being part of the Profile fingerprint:

Group (needs.go:393) — an opaque per-Need co-location bucket. Two Needs sharing (Cluster, fingerprint) but differing in Group stay separate in Aggregate, so each preserves its own Same co-location requirement. It crosses the wire as CapacityNeed.group (ADR-0042 Addendum), populated by the operator from CR co-location terms, giving the Phase 1 allocator independent per-gang identity and the shard stable per-gang bookkeeping. Empty Group means “no co-location group” and aggregates with other empty-Group Needs sharing the fingerprint — the common, ~2KB-roll-up case.
AcquisitionParked (needs.go:394) — shard-local, never on the wire (ADR-0042 Addendum). The shard stamps it on Same-Needs whose class has been Phase 1-unsatisfiable for the parking threshold; every supply site then honours it identically (domain choice goes creditable-only, acquisition is skipped in both Phase 1 and Phase 3) so the parked Need keeps its concentrated partial assembly and stops driving Bootstrap/Reclaim churn.
ArrivalUnixNanos — the age clock the Snapshot sort tiebreaks on and the shortfall buffer ages. There is no standalone shortfall package: the buffer and aging live in pkg/shard, the deficit is derived in pkg/decision, and the unsatisfiable Same residual lands in that aged buffer (paper §16/§9).