ADR-0024: Co-location via podAffinity — the `CoLocation` CR field, roll-up aggregates
Status: Accepted
Date: 2026-05-14
Context
The papers are explicit that the roll-up is small and bounded:
fleet-scale-kubernetes.md§6.1: “One CR per pod. Roll-up aggregates.”- §11: “Roll-up message ~2KB regardless of fleet size.”
bigfleet.md§3.1 / §7: roll-ups are full-replacement; the operator translates CRDs to protobuf, “addingSamerequirements where co-location is needed” (§8).
In practice the roll-up was O(unschedulable-pods), not ~2KB. The devpod-5k validation on Uber infrastructure (bigfleet-uber issues #3/#4) showed the shard’s NeedsTable frozen at ~49K Needs while 250K+ CapacityRequest CRs were outstanding — Phase 1 correctly emitting almost nothing because the demand never arrived.
Two compounding causes:
-
The shard’s gRPC server inherited the library’s 4 MiB
MaxRecvMsgSizedefault. A full-replacement roll-up that crossed 4 MiB made the shard’sRecvreturnResourceExhaustedand tore theShard.Sessionstream down; the operator reconnected, re-sent the same oversized roll-up, failed again — a silent reconnect loop.OperatorRollupDurationtimes the build, not the send, so the failure was invisible. Fixed in commite186631(256 MiB ceiling on all servers + clients viapkg/grpcutil) — a safety net so a large roll-up degrades into a slow cycle, never a broken session. -
The roll-up never aggregated.
pkg/operator/rollup.go’scoLocationGroup()readcr.OwnerReferences[0].UID, and the unschedulable-pod-controller (UPC) sets that ownerRef to the Pod itself (correctly — it is the CR’s GC owner). So every CR landed in a unique co-location group,needs.Aggregatekeyed on(cluster, fingerprint, group)merged nothing, and N unschedulable pods produced N Count=1 Needs. A 50K-replica Deployment would emit 50K separate Needs instead of one withcount=50000. This is the cause that matters: the gRPC limit bump alone just moves the ceiling.
The coLocationGroup doc-comment claimed “conventionally the first ownerRef is the workload” — but the UPC’s actual behaviour makes that false. The co-location-by-owner mechanism was effectively dead: every pod was its own group, so Same was never meaningfully emitted, and aggregation never happened.
Decision
Derive co-location from the pod’s podAffinity, carried on the CR as a structured CoLocation field, translated to Same by the operator at roll-up.
podAffinity is the native Kubernetes way a workload declares “schedule me with my peers”. Using it means zero user-facing change — no BigFleet-specific annotation a workload owner must add to adopt the autoscaler. It is exactly the “co-location signal” the paper’s §8 refers to.
Concretely:
-
New CRD field —
CapacityRequestSpec.CoLocation *CoLocationTerm. Symmetric to the existingTopologySpreadfield (spread is the dual of co-location).CoLocationTermis{LabelSelector *metav1.LabelSelector, TopologyKey string}— the autoscaler-relevant projection of one requiredpodAffinityterm. Optional; absent means “no co-location”.The CRD continues to use only standard node-selector operators (
In/NotIn/Exists/DoesNotExist).CoLocationis structured intent, not theSameoperator —Samestays protobuf-only, emitted by the operator at roll-up. This keeps the hard rule intact while making the contract explicit, so non-UPC CR sources (Kueue, custom controllers — the UPC is optional perfleet-scale-kubernetes.md§12) can express co-location too. -
UPC translates
podAffinity→CoLocation.buildCRForPodreadspod.Spec.Affinity.PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution[0]and maps itsLabelSelector+TopologyKeyontoSpec.CoLocation. No requiredpodAffinity→CoLocationnil. The CR’s ownerRef stays the Pod (GC contract unchanged). -
Operator derives the aggregation group and the
Sametopology key fromCoLocation.coLocationGroup(cr)returns a canonical serialization ofSpec.CoLocation(sorted label selector + topology key), or""whenCoLocationis nil. When non-empty, the operator appends aSamerequirement keyed on the term’s ownTopologyKey.This retires the operator-global
Config.CoLocationKey: the topology granularity is a per-workload property of the affinity term, not an operator-wide constant.
Resulting behaviour
- Pods with no
podAffinity(the overwhelming common case — plain Deployments, Jobs, bare pods) → empty group → CRs aggregate purely by profile fingerprint → ~2KB roll-up, as the paper requires. - Pods of one co-located workload (same
podAffinityterm) → same group → aggregate into one Need and carry aSameon the term’s topology key. - Two independent workloads with the same profile shape but different affinity selectors → different groups → stay as separate Needs, each co-located onto its own domain. They are never merged onto one rack.
Scope
- Required
podAffinityonly.preferredDuringScheduling…(soft affinity) is advisory and out of scope. - First required term only, matching the existing
requirementsFromPodtreatment of node affinity (“multiple terms / OR semantics are out of scope for v1”). podAntiAffinityis out of scope. Anti-affinity is the dual of co-location — conceptually closer toTopologySpread— and is not addressed here.
Consequences
What we gain
- The roll-up aggregates, and is ~2KB regardless of fleet size — the paper’s stated property, now actually true.
- Zero user-facing change. Workloads that already use
podAffinityget co-location for free; workloads that don’t are unaffected. - The co-location contract is explicit on the CRD, usable by any CR producer, not just the UPC.
Sameis emitted exactly where co-location is declared — not as a side effect of having a controller, and not via a magic annotation. A plain 50K-replica Deployment is no longer at risk of being forced onto one topology domain.
What we lose / cost
- A CRD schema regeneration (
make generate) and a deepcopy update. - The operator-global
CoLocationKeyconfig retires — removed from the operatorConfig, thecmd/operatorflags, the scaletest chart, and the profiles that set it. - The scaletest harness’s
sameRackarchetype switches from a syntheticScaletestWorkloadownerRef on the pod to a realpodAffinityterm — which is a more faithful model of how production workloads express co-location anyway.
What stays the same
- BigFleet’s shard / coordinator / decision engine: unchanged. The shard receives already-aggregated Needs with
Samebaked into profile requirements, exactly as before. - The wire format: unchanged.
Need.Groupremains in-memory operator state;Samecontinues to travel as aNodeSelectorRequirement. - The CR’s GC contract: the CR is still owned by its Pod.
Alternatives considered
- A
bigfleet.lucy.sh/co-location-groupannotation users set on pods. Rejected: it forces workload owners to change their manifests to adopt BigFleet — backwards from the paper’s “operator translates existing CRD signals” model.podAffinityis the signal users already write. - Co-location group = the pod’s controlling workload UID (
GetControllerOf). Rejected: it would stampSameon every controller-managed workload, over-constraining placement for workloads that never asked for co-location; and decoupling the aggregation key from theSametrigger to avoid that just reproduces this ADR’s design. - Aggregate by profile fingerprint only, drop co-location grouping. Rejected as a destination: it guarantees ~2KB but silently removes a paper §8 feature and the harness profiles that exercise
Same. - The gRPC limit bump alone (
e186631). Necessary but insufficient — it only moves the ceiling. Kept as the companion safety net.
References
fleet-scale-kubernetes.md§6.1, §7, §8, §11, §12 — one CR per pod, roll-up aggregates, ~2KB,Sametranslation, the UPC is optional.bigfleet.md§3.1, §8 — full-replacement roll-ups, co-location.- ADR-0022 (
Need.Countis Pod count, BigFleet diffs aggregates) — the aggregation model this ADR makes actually hold. - Commit
e186631— the 256 MiB gRPC message ceiling (pkg/grpcutil). bigfleet-uberissues #3 and #4 (private) — the empirical data: NeedsTable frozen at ~49K against 250K+ CRs.