BigFleet architecture
This is the synthesis. The full design is in the BigFleet paper (vendored at papers/bigfleet.md); read this first if you want a tour, the paper if you want the rationale.
The shape
BigFleet is two tiers and three component types:
┌─────────────────────────────────────────────────────────────┐│ COORDINATOR (Tier 1) ││ - Raft-replicated (3 replicas, single region) ││ - Owns: shard membership, cluster→shard map, ││ topology-domain→shard assignments, quotas, providers ││ - Does not make provisioning decisions │└──────────────────────────┬──────────────────────────────────┘ │ unary RPCs (rebalance instructions │ ride on shard-pulled ReportShard) ┌──────────────────┼──────────────────┐ ▼ ▼ ▼┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ SHARD │ │ SHARD │ │ SHARD ││ (Tier 2) │ │ │ │ │ ...│ - Hot path │ │ │ │ ││ - Inventory │ │ │ │ ││ - Decision │ │ │ │ ││ engine │ │ │ │ │└──┬─────┬─────┘ └──┬─────┬─────┘ └──┬─────┬─────┘ │ │ │ │ │ │ │ ▼ │ ▼ │ ▼ │ PROVIDER │ PROVIDER │ PROVIDER │ (out-of-tree) │ (out-of-tree) │ (out-of-tree) │ │ │ ▼ ▼ ▼OPERATOR OPERATOR OPERATOR(per cluster) (per cluster) (per cluster) │ │ │ ▼ ▼ ▼[ Kubernetes ] [ Kubernetes ] [ Kubernetes ]Coordinator (Tier 1)
Three replicas, hashicorp/raft over BoltDB, single region. Holds the slow-changing fleet state every shard needs but no shard owns:
- Shard membership — which shards exist, with what addresses.
- Cluster→shard binding — clusters are bound to a shard on first contact and never move (no cluster-lifecycle API).
- Topology-domain→shard assignment — at scale, machine IDs aren’t tracked here; topology domains (rack, zone, AZ) are. ~100K entries instead of 100M.
- Quotas — cluster-level entitlement counters, allocated to shards on demand.
- Provider registry — addresses of registered
CapacityProviderservices.
The coordinator is not in the hot path. Shards run autonomously; the coordinator only intervenes for cross-shard rebalance and quota allocation.
See ADR-0002 for why single-region.
Shard (Tier 2)
The shard is BigFleet’s heart. One process, one decision engine, one inventory snapshot. Per-shard scale ceiling is ~500K machines and ~5K cluster sessions; deployments grow by adding shards, not by growing one shard.
Each shard holds:
- Inventory — every machine the shard owns, in one of seven states (see
pkg/machine). In-memory, refilled from the provider’sListon startup and reconciled every cycle. - NeedsTable — every cluster’s last-known full demand, priority-sorted. Each rollup is a full replacement; never partial.
- Sessions — one bidi gRPC stream per managed cluster. Operator-initiated; the shard never opens an outbound connection to a cluster.
- Decision engine — three-phase synchronous loop. Runs every cycle (default 1 s).
The hot path never depends on the coordinator. This is enforced by pkg/shard/no_coordinator_dep_test.go, which parses pkg/shard’s import graph and fails CI if pkg/coordinator shows up.
Operator
One per managed Kubernetes cluster. Outbound-only — no inbound listener, no public API surface. Dials the shard, holds one bidi Shard.Session stream, and multiplexes:
- Outbound:
ClusterCapacityNeedsrollups (every 10 s, full replacement),BootstrapBlobResponseto shard pulls,ReclaimAckto drain instructions. - Inbound:
BootstrapRequest,ReclaimInstruction,NodeStateUpdate,AvailableCapacityUpdate.
The operator reads CapacityRequest CRs from the cluster, aggregates them into a NeedsTable, and emits the rollup. It also watches Pod (optionally; via the separate bigfleet-unschedulable-pod-controller) and writes AvailableCapacity / UpcomingNode CRs back so users can kubectl describe them.
Provider
One process per backend (AWS, GCP, bare-metal, MAAS, …). Lives in a separate repo. Implements six gRPC RPCs: Create, Configure, Drain, Delete, Get, List. No Watch — reconciliation is List + Get.
Why out-of-tree: Kubernetes spent years undoing in-tree CCM and CSI providers; we don’t repeat that. The repo ships only the proto contract, a conformance suite, and a test-fixture fake.
The decision engine: Phase 1 / 2 / 3
Every shard cycle (default 1 s) runs three phases sequentially. Order matters and is fixed:
Phase 1 — Assign
Walk the NeedsTable in priority order (highest first). For each Need(cluster, profile, count):
- Find idle inventory matching the profile. Pick the cheapest by effective cost.
- If insufficient idle, ask the provider to provision more (Speculative → Creating → Idle, then Idle → Configuring → Configured).
- If still insufficient, record a shortfall.
Effective cost is fixed:
effective_cost = price + (interruption_probability × interruption_penalty)price is per-hour. interruption_probability is provider-declared only — no cluster-side override. interruption_penalty is the cluster’s stated cost of having this workload interrupted, in dollars.
Phase 2 — Preempt inversions
A “priority inversion” is a configured-but-low-priority machine occupying inventory that a higher-priority shortfall could use. Phase 2 walks shortfalls and looks for victims.
A victim’s score is:
victim_score = victim's interruption_penalty + reclamation_penalty + drain_grace_remaining × per-hour-costLowest score wins. Two penalties, deliberately distinct:
interruption_penalty— cost of interrupting the workload. Used ineffective_costand victim scoring.reclamation_penalty— operational value tied to the specific machine (long-running stateful work, in-flight training, etc.). Used in idle tiebreak, victim scoring, and Phase 3 release.
Phase 3 — Reclaim excess
Walk configured machines whose cluster no longer needs them. Drain to Idle. The release order is governed by reclamation_penalty (cheapest first).
Why phases, not a single optimiser
A single LP-style optimiser could in principle replace all three phases. We don’t, for two reasons in the paper: (a) deterministic phase ordering makes the engine debuggable; (b) victim selection’s “drain grace” interaction with effective-cost arithmetic doesn’t fit a single objective without knobs. See the BigFleet paper §8 (or papers/bigfleet.md).
Static stability
The load-bearing safety property: clusters keep running with BigFleet entirely down.
Implications:
- Configured machines stay configured if BigFleet is unreachable.
- Phase 3 reclaims only run when a cluster’s NeedsTable actually drops below the configured count.
- The shard’s hot path has no dependency on the coordinator. (Programmatic guard:
pkg/shard/no_coordinator_dep_test.go.) - The data plane (shards) operates autonomously during coordinator failover.
- Operators reconnect with exponential backoff; they do not panic or alter cluster state during a disconnect.
This rules out a class of designs that would otherwise be tempting (e.g., putting cluster→shard mapping resolution on the per-rollup hot path, or making Phase 3 depend on quota recheck).
Wire formats and protocol invariants
- Roll-ups are full replacement. Every
ClusterCapacityNeedsis the cluster’s complete desired state. Never partial. Sametopology operator is protobuf-only. CRDs useIn,NotIn,Exists,DoesNotExist. The operator translates co-location signals toSameduring rollup.- Topology constraints do not cross shard boundaries. A
Same-rack request that can’t be satisfied within a shard becomes a shortfall, never resolved cross-shard. - Stream coalescing uses
supersedes_key— explicit field on coalescing message types so reconnect ordering is safe. - Provider
Listis reconcilable, optionally incremental.since_revisionis opaque bytes; conformance-gated above a documented threshold. - Penalty buckets are powers of 2 ($0.50 to $8.4M) so cross-cluster aggregation has stable boundaries.
Failure modes
| Failure | Effect | Recovery |
|---|---|---|
| Shard crash | Cluster operators reconnect; in-flight transitions resume on List reconcile | Auto |
| Coordinator quorum loss | No cross-shard rebalance, no new quota allocations; existing decisions continue | Restore quorum |
| Provider unreachable | Affected machines mark Failed with last_error; remaining demand satisfied from rest of inventory | Provider returns; manual cleanup of Failed machines |
| Operator → shard partition | Cluster runs on last-known configured machines; rollups queue locally | Reconnect; backoff |
| Region-wide BigFleet down | All clusters keep running on existing inventory | Restore BigFleet |
The two M10 fault-injection scenarios under ../sim/scenario/provider_failure.go exercise the third row.
What BigFleet deliberately is not
- Not a scheduler. Does not place pods. Does not simulate kube-scheduler.
- Not a cluster-lifecycle manager. Clusters are bound to shards on first contact; there is no register/deregister API.
- Not a quota / admission / entitlement engine. Priority is the sole throttling mechanism.
- Not a cloud-cost optimiser. The cost formula is fixed and not pluggable.
- Not in-tree for providers. See
provider-author-guide.md.