Skip to content

BigFleet

Many Kubernetes clusters. One fleet. One decision engine.

Scale testing supported by

What it is

BigFleet is the piece of software the cluster autoscaler stops being when you outgrow one cluster. It runs once for an entire Kubernetes fleet. Each cluster reports the capacity it needs in a small standard contract — three CRDs and one protobuf message. BigFleet decides which physical or virtual machines should serve which cluster, and provisions, reclaims, and rebalances them across the fleet.

Inside each cluster, kube-scheduler still places pods. BigFleet doesn’t touch that. It’s the tier above: it makes sure the right machines exist to be scheduled on, anywhere they’re needed.

BigFleet sits one tier above per-cluster autoscalers: each Kubernetes cluster runs an operator that reports capacity needs to BigFleet, which provisions from a fleet-wide pool of bare metal, cloud, and spot machines.

What changes for you

  • Capacity is a fleet-wide pool, not a per-cluster allocation. A machine idle in one cluster serves an unschedulable workload in another (both bound to the same shard) — CPU, memory, or a scarce GPU rack alike — so the hardware goes to the team that needs it, not the team that owns the cluster.
  • Costs are comparable across capacity classes. Bare metal, reserved, on-demand, and spot are scored apples-to-apples by price + interruption_probability × interruption_penalty. The autoscaler picks the lowest-cost option — by compute and interruption risk — among the machines that satisfy the request’s requirements and interruption tolerance.
  • Clusters become disposable. No more “GPU cluster” / “batch cluster” snowflakes; every cluster looks identical to BigFleet, and to your platform team’s runbooks.
  • The autoscaler is not on the critical path. Static stability is a hard rule: clusters keep running with BigFleet entirely down. Running pods, kubelet, kube-scheduler are unaffected. Only new provisioning pauses.
  • Integration is small. An operator pod runs alongside each cluster. Three CRDs (CapacityRequest, UpcomingNode, AvailableCapacity) and one protobuf message. Nothing touches kube-scheduler or kubelet.

Proven at scale — and you can check the receipts

Designed for fleets of up to 100 million machines across thousands of clusters, horizontally sharded. The most recent passing benchmark — uber-50k: a ~50,000-machine fleet carrying ~5,000,000 pods across 200 clusters on 40 hosts in 5 regions, served by a single shard — held every capacity-delivery hop BigFleet owns inside SLO with zero unmet demand under a real, default, uncapped kube-scheduler (shard cycle p99 4.08 s under a 5 s bar, roll-up p99 0.76 s, configure-phase p99 1.21 s).

You don’t have to take the numbers on faith. Every run on the scale-test results page ships a full Prometheus TSDB you can load straight into Grafana and inspect for yourself — the gate figures on the page are read from each run’s committed summary, generated from the data, not hand-entered. Receipts, not just a scorecard.

Providers for every substrate

BigFleet decides which machines should exist; capacity providers actually create, configure, and reclaim them — across an ecosystem of 12 conformance-certified providers spanning hyperscalers, regional clouds, and bare metal:

Each is a container you deploy alongside BigFleet with a Helm chart and scoped credentials; BigFleet dials out to it. Providers are out of tree by design — Kubernetes spent years undoing in-tree providers, and we don’t repeat that, so a provider’s release cadence is decoupled from BigFleet’s. Writing one for a new substrate means implementing a small backend on top of the shared providerkit library (which gets fencing, idempotency, and the machine contract right once) and passing a 93-behavior conformance suite.

Browse them, or write your own → Providers · the canonical registry lives at bigfleet-providers.lucy.sh.

How you run it

A cluster joins the fleet by running the operator chart — outbound-only, no inbound listener, no per-cluster autoscaler tuning:

Terminal window
helm install bigfleet-operator oci://ghcr.io/intunderflow/charts/bigfleet-operator \
--version 0.1.0 \
--namespace bigfleet-system --create-namespace \
--set clusterID=cluster-prod-eu-1 \
--set shardAddress=bigfleet-shard.bigfleet-system.svc:7780

From there it’s standard platform work: the operator rolls up the cluster’s needs, BigFleet provisions, and you watch a handful of signals —

# Any shard reporting demand it can't satisfy?
bigfleet_shard_shortfalls > 0
# Are shards keeping up? (cycle p99)
histogram_quantile(0.99, sum by (le) (rate(bigfleet_shard_cycle_duration_seconds_bucket[5m])))
# Any cluster's session to its shard flapping?
sum by (cluster) (rate(bigfleet_operator_session_reconnects_total[5m])) > 0

Day-to-day, most days are quiet — static stability means an outage in BigFleet itself is felt as the absence of new provisioning, not a fleet-wide incident. The full operating story (install, runbook, incident modes, FinOps queries, on-call triage) is in the operator guide, with sizing in the scaling guide, the contract in the SLOs, and release-gating in the scale-test runbook.

Compared to the cluster autoscalers

BigFleet replaces the cluster-level capacity autoscalerCluster Autoscaler or Karpenter. Both consume the same input signal (PodScheduled=False, reason=Unschedulable) and produce the same kind of output (a new node). Running them alongside BigFleet would have two deciders racing for the same trigger; one would always be undoing what the other just did.

The difference is where the decision happens. Cluster Autoscaler and Karpenter run inside each cluster and provision per-cluster, in isolation. BigFleet runs once across the fleet and makes the same kind of decision with full cross-cluster context — so a GPU rack idle in cluster A can be reclaimed and re-provisioned to serve a training job’s unschedulable pods in cluster B: the same physical hardware, repurposed in place, with no new capacity to procure (A and B must be served by the same shard).

What BigFleet does not replace: kube-scheduler. Pod placement stays cluster-local, the same as it is today.

If your fleet is one cluster, you don’t need BigFleet. If it’s a hundred, you start to.

Why fleet-level

The hyperscalers built fleet-level capacity control planes in-house — Google has Borg, Meta has Twine. They did because per-cluster capacity management collapses at fleet scale.

Most organisations now run tens to thousands of Kubernetes clusters and manage capacity per-cluster. The result is well documented: Datadog’s State of Cloud Costs report puts average enterprise CPU utilisation around 18 %, with overprovisioning factors of 2× to 5× and waste estimates between $50 000 and $500 000 per cluster per year. A team’s GPUs sitting idle in cluster A can’t help that team’s training job in cluster B. Multi-node training, gang scheduling, and priority-based preemption — all the things AI/ML workloads actually need — require fleet-wide capacity decisions. Per-cluster autoscalers can’t make them.

Where to go next

Status

v1 feature-complete. Designed and implemented by Lucy Sweet, as the reference implementation of two papers — the operating model and the architecture. Coverage: race-detector unit tests, deterministic simulator with golden traces, steady-state soak under churn, multi-cluster end-to-end on kind, a conformance suite with a 12-provider ecosystem, and scale-test results on a real multi-host cloud fleet — each published run shipping a full Grafana-loadable Prometheus receipt. Real provider implementations live in separate repositories by design.


Uber’s support is limited to providing compute for scale-test runs. BigFleet is otherwise independent of Uber and is not sponsored, endorsed, or maintained by Uber. All trademarks belong to their respective owners.