BigFleet operator guide
For the human running BigFleet — installing it, monitoring it, responding when something is wrong.
Architecture in one paragraph
BigFleet is two tiers. The coordinator (Tier 1) owns Raft-replicated fleet state — shard membership, cluster→shard map, topology-domain→shard assignments, quota allocations, provider registry. The shards (Tier 2) own machines and make provisioning decisions on their hot path. Per-cluster operators dial a shard over a long-lived bidirectional gRPC stream. The coordinator does not make provisioning decisions; shards do. Static stability is the load-bearing safety property: clusters keep running with BigFleet entirely down. (See the BigFleet paper for the full architecture, also vendored at docs/papers/bigfleet.md.)
Components
| Component | Where it runs | What it owns |
|---|---|---|
| Coordinator | Standalone (3 replicas, Raft) — typically a dedicated namespace | Membership, cluster→shard, domain→shard, quotas, provider registry |
| Shard | Standalone (~200 replicas at 100M-node scale; start with 1) | Per-shard machine inventory, decision engine, operator session terminations |
| Operator | One per Kubernetes cluster you want BigFleet to manage | CapacityRequest informer + roll-up; bootstrap blob renderer; reclaim handler |
bigfleet-unschedulable-pod-controller | Optional, per cluster | Watches Pods → creates CapacityRequests for unschedulable ones |
| Provider | One process per real backend (cloud / bare metal). Lives in separate repos. | The actual machine lifecycle |
Install
The coordinator and shards live in your management cluster (or a dedicated cluster). Each managed cluster runs its own operator.
# 1. Install the BigFleet CRDs into every cluster you want managed.kubectl apply -f https://github.com/intUnderflow/bigfleet/raw/main/api/crd/bigfleet.lucy.sh_capacityrequests.yamlkubectl apply -f https://github.com/intUnderflow/bigfleet/raw/main/api/crd/bigfleet.lucy.sh_upcomingnodes.yamlkubectl apply -f https://github.com/intUnderflow/bigfleet/raw/main/api/crd/bigfleet.lucy.sh_availablecapacities.yaml
# 2. Deploy the coordinator + shard control plane on the management cluster.helm install bigfleet ./deploy/helm/bigfleet \ --namespace bigfleet-system --create-namespace \ --set coordinator.replicas=3 \ --set coordinator.bootstrap=true # only for the first install# After the first install, helm upgrade with coordinator.bootstrap=false.
# 3. On each managed cluster, install the per-cluster operator.helm install bigfleet-operator ./deploy/helm/bigfleet-operator \ --namespace bigfleet-system --create-namespace \ --set clusterID=cluster-prod-eu \ --set shardAddress=bigfleet-shard.bigfleet-system:7780
# 4. (Optional) Install the unschedulable-pod controller on clusters# where you want BigFleet to react to Pod scheduling failures.helm install bigfleet-unschedulable-pod-controller \ ./deploy/helm/bigfleet-unschedulable-pod-controller \ --namespace bigfleet-system
# 5. Register your providers with the coordinator (one CLI per provider).# Provider authoring is documented in docs/provider-author-guide.md.Day-2 observability
Each binary exposes Prometheus metrics on a configurable port:
| Binary | Default port | Path |
|---|---|---|
| Coordinator | :8790 | /metrics |
| Shard | :8780 | /metrics |
| Operator | :8770 | /metrics |
bigfleet-unschedulable-pod-controller | :8080 | /metrics |
Key metrics
Health
bigfleet_coordinator_raft_term— increases on leader elections. Rapidly increasing = network partition or stepdown loop.bigfleet_coordinator_apply_total{outcome=...}— Apply outcomes. Spike inerrororfsm_error= state-machine issue.bigfleet_shard_cycle_duration_seconds— histogram. p95 should stay below ~50 ms; p99 below 100 ms at 5K-machine inventory.bigfleet_operator_session_reconnects_total— should be near zero in steady state. Bursts = shard unhealthy or network blip.
Throughput
bigfleet_shard_actions_total{kind=...}— Bootstrap / Provision / Reclaim / Preempt counts. Sustained high Preempt = priority-inversion churn; investigate the workloads’ priorities.bigfleet_operator_acknowledged_total— CRs transitioning Pending → Acknowledged. Should track the rate of unschedulable-pod arrivals.bigfleet_shard_inventory_machines{state=...}— current machine counts by state. Stable in steady state; spikes through transitional states (Creating / Configuring / Draining / Deleting) during scale events.
Pressure
bigfleet_shard_shortfalls— unresolved demand the shard couldn’t satisfy locally. Persistent non-zero = under-provisioned fleet or over-aggressive workload priorities.bigfleet_coordinator_pending_instructions{shard=...}— coordinator-issued instructions awaiting ack. Should drain to zero between rebalance cycles.
Suggested dashboard layout
Two panels per group:
- Coordinator: Raft term (single-stat), Apply rate by outcome (timeseries), pending instructions per shard (timeseries).
- Shard: cycle duration p50/p95/p99 (heatmap), action rate by kind (timeseries), inventory by state (stacked timeseries), shortfalls (single-stat with alert).
- Operator (per cluster): rollup duration p50/p95 (heatmap), CR acknowledgement rate (timeseries), session reconnects (single-stat with alert).
Runbook
| Alert | What’s happening | What to do |
|---|---|---|
Shard up==0 | The shard process is down | Restart. Existing pods/CRs in managed clusters are unaffected (static stability). New provisioning pauses until the shard returns. |
| Coordinator leader stepdown | Network blip or replica restart. New leader within ~1s normally. | Investigate if it loops. Check bigfleet_coordinator_raft_term rate. |
Sustained bigfleet_shard_shortfalls > 0 | Demand exceeds available capacity in the shard’s slice | (a) Check the donor-shard summaries: is there free capacity elsewhere? Cross-shard rebalance should be active. (b) If fleet-wide saturated, procurement needs to add capacity. |
bigfleet_operator_session_reconnects_total rising | Shard ↔ operator network unhealthy | Check shard health; check kubelet/CNI on the cluster running the operator. Operators reconnect with backoff automatically. |
bigfleet_shard_cycle_duration_seconds p99 > 1s | Hot path overloaded | (a) Is the inventory huge? Check bigfleet_shard_inventory_machines. (b) Provider RPCs slow? Check provider’s own metrics. (c) Consider sharding more aggressively. |
Security
- Operator → shard: mTLS recommended. Per-cluster client certs scoped to the cluster’s claimed
cluster_id. Production deployments should add a server-side check that the cert’s CN matches theHello.cluster_id. - Shard → coordinator: mTLS recommended. The coordinator gRPC service is internal; a network-policy fence inside the management cluster is also worthwhile.
- Shard → provider: depends on the provider. Cloud providers typically use cloud-IAM for the underlying API; the BigFleet ↔ provider gRPC channel between them is process-local on the shard host or cell, so a Unix socket or localhost-only TCP is reasonable.
Upgrades
CRD upgrades
v1alpha1 is in flux until v1beta1. Use kubectl apply for in-place CRD upgrades; the operator informers re-list on schema bumps. Never delete and recreate a CRD with existing CapacityRequest objects (you’ll lose state).
Coordinator upgrades
Helm upgrade is rolling — one replica at a time. The Raft cluster tolerates one replica down out of three; leader election handles stepdown automatically. Always upgrade with coordinator.bootstrap=false after the first install.
Shard upgrades
Rolling. Each shard’s existing assignments are persisted by the coordinator (cluster → shard, domain → shard); a fresh shard process re-reads them on startup and re-establishes provider connections.
Operator upgrades
Rolling. The Shard.Session stream is reconnect-safe; in-flight bootstrap requests are reissued by the shard on the new connection.
Cross-references
- Architecture: BigFleet paper (vendored at
docs/papers/bigfleet.md) - Operating model: Fleet-Scale Kubernetes paper (vendored at
docs/papers/fleet-scale-kubernetes.md) - Implementation plan:
docs/plan.md - Provider authoring:
docs/provider-author-guide.md - Scaling sizing:
docs/scaling-guide.md