Wire protocols and CRDs in depth

BigFleet’s interfaces are four .proto files and three CRDs, but the contracts that matter are not in the field lists — they are in the semantics the field lists encode: full-replacement roll-ups, two structurally-identical-but-semantically-distinct penalties, Same as a wire-only operator, opaque cursors and coalescing keys whose whole purpose is to make reconnect ordering safe. This page opens api/proto/bigfleet/v1alpha1/*.proto, api/crd/*.yaml, and the one place all three representations meet — pkg/conv — and explains the invariants each consumer relies on and the failures each one prevents. It complements the surface summary in ../api-reference.md, the demand-ingest mechanics in needs-table.md (bucketing arithmetic, Same-folding, profile fingerprints), and the provider RPC mechanics in provider-protocol.md; it does not repeat them. Read ../architecture.md for the two-tier shape first.

The three representations and the one boundary

A single demand travels through three forms: a CRD (CapacityRequest, what a user writes), a proto (ClusterCapacityNeeds, what crosses the wire), and a domain type (needs.Need, what the engine walks). The CRD→proto translation lives operator-side in pkg/operator/rollup.go; the proto↔domain translation is pkg/conv. Keeping these distinct is deliberate (pkg/conv/conv.go:1-11): the shard hot path uses small Go structs without proto-runtime overhead, so conversion happens once at the gRPC boundary — incoming messages translated on the way in, outgoing actions on the way out — and the rest of the codebase never sees oneof wrappers or enum stringers. The proto can evolve without touching the engine, and the engine can be tuned without re-marshalling.

The boundary is also a validation gate, and this is the load-bearing part. pkg/conv is the demand-ingest mirror of the provider-ingest gate shard.validateProviderMachine (pkg/conv/conv.go:24-36, M68b). A roll-up carrying an out-of-range penalty bucket or an unparseable resource quantity is rejected whole — NeedsFromRollup returns an error, the caller logs and counts it, and the cluster keeps its last-known-good demand. The asymmetry this fixed was silent corruption: an unknown bucket enum would alias into a real domain bucket (the domain type is a uint8, so an out-of-range int32 from a newer or buggy operator wraps into a legitimate bucket and mis-prices the workload in victim scoring — penaltyBucketFromProto, pkg/conv/conv.go:187-197), and a malformed quantity would degrade to zero inside the pkg/needs vector arithmetic, which deliberately tolerates malformed strings on the hot path where values are already canonical (resourceQtysFromProto, pkg/conv/conv.go:111-123). Both are fine after the trust boundary; neither is acceptable at it. The trust boundary is where they get caught.

`capacity.proto` — the unit of demand

`ClusterCapacityNeeds` is a full replacement, always

ClusterCapacityNeeds (api/proto/bigfleet/v1alpha1/capacity.proto:20-31) is what an operator sends its shard each cycle. The file comment states the invariant the entire engine depends on: every send up the Shard.Session stream is a full replacement of the cluster’s desired state — the receiver treats the message as the cluster’s complete current need set, and withdrawal is implicit (a CR no longer present in the list is a CR the cluster no longer wants). There are no deltas, no add/remove ops, no sequence-numbered patches. This is paper §3.1, and it is what makes the protocol robust to message loss: a dropped roll-up costs one cycle of staleness, never a permanently-diverged view, because the next roll-up overwrites everything.

This invariant is enforced end to end, not just asserted. Operator-side, the pending-rollup slot is a single-slot atomic.Pointer: enqueuing a fresh roll-up atomically replaces and drops any older one still queued behind a slow stream (pkg/operator/stream.go:76-80, pkg/operator/rollup.go:64-72). Coalesce-by-replace is correct precisely because roll-ups are full replacement — there is never information in the older message that the newer one lacks. Shard-side, the corresponding needs.Replace(cluster, …) swaps the cluster’s entire contribution in the NeedsTable. The cluster_id (capacity.proto:23) is assigned by the operator and never changes for the life of the cluster — it is the replacement key, and it is the same identity the coordinator binds permanently to a shard on first contact (no registration RPC exists to change it).

`CapacityNeed` — a constrained aggregate resource request, not a pod count

CapacityNeed (capacity.proto:39-93) is the aggregated unit of demand, and its shape is ADR-0027’s central decision: demand is a resource vector, machine count is the autoscaler’s output, never the cluster’s input. The aggregation key is (requirements, priority, spread, interruption_penalty_bucket, reclamation_penalty_bucket) plus group; CRs whose fields all match collapse into one CapacityNeed. Two fields carry the resource demand:

aggregate_resources (capacity.proto:44-54) — the vector sum of the per-replica requests of every CR in this need. The autoscaler diffs this against the sum of Allocatable over matching machines, in resource-vector space (ADR-0022). It never reconstructs a per-pod count.
min_unit (capacity.proto:77-82) — the largest atomic schedulable unit, the indivisibility floor: a resource vector every provisioned machine must individually host (an 8-GPU pod needs 8 GPU on one machine). Aggregation sums the former and maxes the latter (pkg/operator/rollup.go:166-174).

Field 4 (count, the old post-aggregation pod count) is reserved (capacity.proto:60-63) — a tombstone, not a gap, so the number can never be re-used with different meaning. The same tombstone appears on Shortfall.count (coordinator.proto:224): shortfall demand is the deficit resource vector, not a machine count.

The group field (capacity.proto:84-92, ADR-0042 Addendum) is opaque co-location identity, set only on Same-carrying needs — one stable value per gang, for the gang’s lifetime — so the engine’s per-gang diagnostics and per-need bookkeeping have something to key on. The operator already emits one CapacityNeed per gang (gangs never merge in aggregation, ADR-0024); group names them. The contract is strict: the autoscaler must not derive semantics from the value beyond equality. It is a name, not a payload.

`Same` is protobuf-only — and why the CRD can’t express it

NodeSelectorRequirement (capacity.proto:98-114) mirrors core/v1.NodeSelectorRequirement plus one operator the Kubernetes type does not have: OPERATOR_SAME (capacity.proto:106-110). Same means all machines provisioned for this need must share a value for this key (all 64 nodes on one rack). It is valid only in the protobuf wire format and is never surfaced on the CRD. The CapacityRequest CRD accepts standard operators only — In, NotIn, Exists, DoesNotExist (api/crd/bigfleet.lucy.sh_capacityrequests.yaml:161-194); the schema and Go docstring both say so explicitly.

The reason is that Same is not something a user declares — it is something the operator derives. A pod expresses co-location through podAffinity, and the CRD captures the autoscaler-relevant projection of that as a separate coLocation field (capacityrequests.yaml:61-128, ADR-0024), structured as {topologyKey, labelSelector} — the dual of topologySpread. At roll-up, the operator canonicalises that term into a group key and, when present, appends a Same requirement on the term’s topologyKey (coLocationGroup and withSameRequirement, pkg/operator/rollup.go:199-267). CRs with an equal co-location term aggregate into one need and co-locate onto one domain; CRs with different terms stay separate even when their profiles are identical, so independent workloads never get folded onto the same rack. Putting Same on the CRD would invite users to write it by hand and get the gang semantics subtly wrong; deriving it from podAffinity keeps the one source of truth pod-side.

A hard companion invariant, enforced nowhere in the proto but everywhere in the engine: Same constraints do not cross shard boundaries. A Same-rack request unsatisfiable within a shard becomes a Shortfall (coordinator.proto:215-233), never a cross-shard resolution (../architecture.md “Wire formats”).

`PenaltyBucket` — coarsening workload-specific dollars

PenaltyBucket (capacity.proto:136-165) coarsens raw dollar penalties to a power-of-2 log scale so that workloads with slightly different penalty values still aggregate into a single need, and the roll-up message stays bounded even when penalties are workload-specific. Boundaries run ZERO, HALF_DOLLAR, then powers of 2 from $1 to $8,388,608 (≈$8.4M), then PINNED for anything larger (treated as effectively non-interruptible). Selection rounds the raw value up to the nearest bucket.

Two distinct penalties ride here, and conflating them is the most common way to ship a wrong implementation:

interruption_penalty_bucket (capacity.proto:74) — the cost of interrupting the workload. Feeds the fixed cost formula effective_cost = price + interruption_probability × interruption_penalty and victim scoring.
reclamation_penalty_bucket (capacity.proto:75) — the operational value tied to this specific machine (burned-in GPUs, warmed caches). Feeds idle tiebreak, victim scoring, and Phase 3 release.

They are not the same thing, not derivable from priority, and not derivable from each other; the CRD documents both separately (capacityrequests.yaml:129-160). There is no operational_value field — that is a hallucination; the machine-tied value is reclamation_penalty. The bucketing arithmetic itself (BucketForDollars, UpperBoundDollars) lives in pkg/needs/needs.go:127-177 and is detailed in needs-table.md; the only wire fact that matters here is that the operator is the canonical place where raw dollars become buckets (profileFromCapacityRequest, pkg/operator/rollup.go:311-324), using AsApproximateFloat64 rather than AsInt64 (M68b) so a fractional 500m-dollar penalty doesn’t flatten to the $0 bucket and erase the sub-dollar boundary the $0.50 bucket exists for. By the time a bucket crosses the wire it has already been chosen; the shard re-validates the range (penaltyBucketFromProto) but never re-buckets.

The proto enum and the pkg/needs domain enum are deliberately numeric-aligned, which is why conv casts directly (needToCapacityNeed, pkg/operator/rollup.go:367-368; penaltyBucketFromProto, pkg/conv/conv.go:187-197) — but that alignment is exactly why the range check is mandatory at ingest, because a direct cast of an out-of-range value would otherwise alias silently.

`shard.proto` — one stream, all traffic

The operator dials out; the shard never listens inbound on a cluster

shard.proto defines a single RPC: Shard.Session(stream OperatorMessage) returns (stream ShardMessage) (api/proto/bigfleet/v1alpha1/shard.proto:22-24). This one operator-initiated bidirectional stream carries every piece of cluster↔shard traffic — roll-ups, bootstrap-blob request/response, reclaim instructions and acks, node-state and available-capacity updates (shard.proto:1-14). The operator is outbound-only: it dials the shard, holds one long-lived stream, and never opens an inbound listener. This is a hard rule, not a convenience — it means a cluster needs no public API surface, no ingress, no inbound firewall hole, and BigFleet can reach a cluster only through a connection the cluster itself opened. There is no inbound GenerateBootstrap RPC on the operator; bootstrap is pulled over this stream (see below).

The two message envelopes are oneofs. OperatorMessage (shard.proto:27-34) carries Hello, ClusterCapacityNeeds, BootstrapBlobResponse, ReclaimAck. ShardMessage (shard.proto:37-45) carries Acknowledgement, BootstrapRequest, ReclaimInstruction, NodeStateUpdate, AvailableCapacityUpdate. The direction of each is the whole protocol: the shard pulls bootstrap blobs and pushes reclaim instructions and state updates; the operator answers and acks.

`Hello` first, and identity binding

The first frame must be Hello (shard.proto:49-56; enforced at pkg/shard/session.go:37-45 — a non-Hello first frame is InvalidArgument). Hello carries cluster_id and protocol_version; its field 2 (capabilities) is reserved — capabilities were never negotiated and the field was removed in M66.1. On reconnect the operator sends a fresh Hello and the shard re-acks it (pkg/shard/session.go:121-122); re-Hello mid-stream is rare but legal.

Hello.cluster_id is otherwise a free-text impersonation vector — a forged value could receive another cluster’s reclaim instructions or zero its capacity with a forged full-replacement roll-up. On an mTLS transport the shard binds it: the client certificate’s URI SAN must assert the cluster being claimed, or the session is rejected PermissionDenied (pkg/shard/session.go:47-63, ADR-0048). Plaintext transports skip the check — identity is only as strong as the transport, and that posture is documented in the ADR. Identity binding is detailed in fencing-and-identity.md.

`Acknowledgement` echoes term and epoch — for humans, not control flow

Acknowledgement (shard.proto:58-69) echoes the request kind plus coordinator_term and shard_epoch at ack time. The docstring is explicit that operators do not act on the term/epoch — they are for incident review, to correlate which coordinator term and shard epoch processed a message. Putting them on the ack and not making operators react to them keeps the operator free of any hot-path dependency on coordinator state, consistent with static stability.

Bootstrap is pulled, not pushed

BootstrapRequest (shard.proto:73-84) is the shard pulling a kubelet bootstrap blob from the operator; the operator answers with BootstrapBlobResponse (shard.proto:86-102) echoing the request_id. The shard sends the request and blocks on the matching response (requestBootstrap, pkg/shard/session.go:315-329); the operator renders the blob via its BootstrapRenderer and replies (pkg/operator/bootstrap.go:9-14). The blob is provider-opaque (user_data) and forwarded as-is into CapacityProvider.Configure (shard.proto:90-91). Two contract details matter:

ttl_seconds (shard.proto:93-95) bounds the credential lifetime — the shard must apply the blob before it expires or pull a fresh one. The wire carries the deadline so the shard never applies a stale token.
A non-empty error (shard.proto:97-101) — e.g., the requested kubelet version is outside the apiserver skew window — is treated by the shard as an unsatisfiable requirement, which becomes a shortfall. An unsatisfiable bootstrap is a capacity problem, surfaced the same way a stockout is, not a transport error retried forever.

Reclaim instructions and the priority-scaled grace

ReclaimInstruction (shard.proto:107-120) tells the operator to drain specific Kubernetes node names with a supplied grace_period_seconds; the operator passes that grace to the kubelet’s graceful node shutdown, honouring PDBs up to the bound. The grace scales with the priority gap between preemptor and victim (10s / 30s / 2m / 10m, shard.proto:104-106); preemptor_priority (shard.proto:117-119) is observability-only — operators may surface it in events but must not branch on it. ReclaimAck (shard.proto:122-128) returns nodes_started, which equals the instruction count unless some nodes already vanished locally — so the shard learns the operator began draining even when its node-set raced a local deletion.

`NodeStateUpdate`, `AvailableCapacityUpdate`, and `supersedes_key` coalescing

The two shard→operator update types are the coalescing message types, and they carry an explicit supersedes_key (shard.proto:130-211). The mechanism: the sender’s outbox is permitted to drop an older queued frame when a newer one with the same key arrives, and the operator applies frames in arrival order. This is the design that makes reconnect ordering safe (the §0.1 stream-coalescing decision): because identity is an explicit field rather than implied by position or timing, a reconnect that replays the latest frame per key converges without any ordering subtlety — newer always wins, and “newer” is unambiguous.

NodeStateUpdate (shard.proto:133-171) drives UpcomingNode CR phase transitions. Its key is conventionally "node:<machine_id>" (set shard-side at pkg/shard/session.go:398-408). The machine_id is stable across speculative→idle→configured (shard.proto:138), so it is the natural coalescing identity for a machine’s lifecycle. The frame also carries labels, resources, and taints (shard.proto:168-170) — the node’s shape, copied into UpcomingNode.Spec so observers see what is coming before it joins; these are present-but-empty while the shard hasn’t yet bound a host. last_error is populated only when state == FAILED.
AvailableCapacityUpdate (shard.proto:185-211) is an advisory, eventually-consistent hint, keyed "available:<profile-fingerprint>". It carries cost_per_hour and a Confidence enum — real capacity is racy, so this is a hint, never a reservation.

Note the layering: coalescing happens at the shard’s outbox, and the operator-side handler explicitly does not re-check the key — it applies in arrival order and tolerates a stale frame racing a fresh one, because a late write is harmless and the next frame corrects it (pkg/operator/upcoming.go:42-45). The operator separately coalesces inbound NodeStateUpdates per machine to collapse rapid Idle→Configuring→Configured bursts into one apiserver write of the terminal state (coalesceNodeStateUpdate, pkg/operator/stream.go:99-161, M44.4) — that is a write-amplification optimisation, distinct from the wire-level supersedes_key semantics, though both rely on machine_id as identity.

The non-coalescing operator→shard responses (BootstrapBlobResponse, ReclaimAck) go through a bounded outbox that drops-newest under load with a metric (pkg/operator/stream.go:81-86, 193-207): they are RPC responses tied to a shard request, so if the queue is full the shard has already timed out and will re-issue — a queued response would deliver too late to matter. Drop-newest is correct precisely because these are not full-replacement and not coalescing; each is a one-shot reply.

`provider.proto` — the contract, at the wire level

The mechanics of the six RPCs, List+Get reconciliation, the dial-out client, and the test fake are in provider-protocol.md. Here we cover only the wire-level semantics and round-trip invariants.

Six RPCs, no Watch. Create, Configure, Drain, Delete, Get, List (api/proto/bigfleet/v1alpha1/provider.proto:47-54). The four lifecycle RPCs are asynchronous — each returns a TransitionAck immediately and the actual transition is observed via subsequent List/Get — and idempotent on (machine_id, target_state) via operation_id reuse (provider.proto:5-7, 285-297). There is deliberately no Watch: reconciliation is List + Get (provider.proto:9-11). Adding a Watch would put a streaming dependency on the provider and re-introduce the staleness-vs-liveness problems List+Get sidesteps.

MachineState is the eight-state machine on the wire. MachineState (provider.proto:59-69) enumerates UNSPECIFIED plus the three stable (Speculative, Idle, Configured), four transitional (Creating, Configuring, Draining, Deleting), and one FAILED. The comments on each value record the host/cluster tuple invariants (SPECULATIVE ⇒ host=nil, cluster=0; CONFIGURED ⇒ host=set, cluster=set). The legal transitions and the provider RPC that drives each edge are in machine-lifecycle.md. conv round-trips every value including UNSPECIFIED, and stateFromProto errors on anything outside the enum (pkg/conv/conv.go:285-307) — an unknown state is rejected, not coerced.

since_revision is an opaque cursor (ADR-0004). ListFilter.since_revision (provider.proto:201-216) is opaque bytes returned by a prior MachineList.revision (provider.proto:189-197). When set, a supporting provider returns only machines whose state changed since the cursor; providers below the conformance threshold ignore it and return full state (provider.proto:211-215). Incremental List is optional and conformance-gated above a documented threshold — the cursor’s opacity is what lets each provider choose its own encoding without a wire change. ListFilter fields 2 and 3 (zone, instance_type) are reserved — unused filters removed in M66.1.

Fencing rides the mutating RPCs only (paper §11, M71). Every mutating RPC — Create/Configure/Drain/Delete — carries (shard_id, shard_epoch, sequence_number) (provider.proto:226-229, 244-247, 266-269, 278-283). Providers track a per-shard_id high-water mark compared lexicographically and reject a non-strictly-newer token with FAILED_PRECONDITION, which is reserved on this service for fencing so callers can alert on zombie-shard incidents mechanically (provider.proto:13-42). Retries are re-stamped with a fresh sequence_number so a transport retry is never mistaken for a replay; idempotency is keyed on (machine_id, target_state), never on the token. Get/List carry no token — reads don’t fence, because a zombie shard reading state harms nothing; only mutations actuate machines and money. DeleteRequest (provider.proto:272-283) exists as a distinct message from Get’s MachineRef purely so Delete can carry the token a read must not — and field 1 is the same string at the same number, so it is wire-compatible with the MachineRef it replaced.

The round-trip invariants that make shard restart lose nothing (M72). A shard rebuilds its inventory from List+Get after a restart, so the wire Machine (provider.proto:97-180) must carry enough to reconstruct a Configured machine’s full protection state — otherwise the shard rejects every provider-reported Configured record at ingest. Two fields exist for exactly this:

cluster (provider.proto:153-164) — the binding, recorded from ConfigureRequest.cluster_id. The provider legitimately knows it (paper §7: Configure carries the cluster), so it is provider-domain state. Populated for CONFIGURING/CONFIGURED/DRAINING, cleared when a Drain completes back to IDLE. Without it the wire could not round-trip a Configured machine and a restarted shard zeroed preemption protection.
shard_metadata (provider.proto:166-179) — opaque shard-side metadata stored from ConfigureRequest.shard_metadata and echoed verbatim on every snapshot. The contract is STORE AND ECHO, NEVER INTERPRET: keys are BigFleet-internal assignment attribution the shard must recover after a restart, deliberately not first-class fields so no provider reads meaning into them. Providers must preserve unknown keys byte-for-byte and must clear the whole map together with cluster when a Drain completes — it is per-assignment state, not per-machine state, and a stale echo would let a restarted shard resurrect a dead workload’s attribution onto a new assignment.

conv carries both verbatim and never manufactures either from the engine’s own Assigned* fields — decoding the well-known metadata keys is the shard’s job at reconcile ingest, not the wire’s (MachineFromProto/MachineToProto, pkg/conv/conv.go:215-223, 260-282). This is the only persistent store the data plane has; it is what makes shard restart lose nothing.

Resources vs Allocatable (ADR-0022). Machine.resources (provider.proto:122-131) is the per-replica request shape the machine is bound to satisfy; allocatable (provider.proto:142-151) is what the hardware actually provides; density is floor(Allocatable / Resources). On the wire allocatable is emitted only when set — when unset, the consumer falls back to treating it as equal to resources (the pre-ADR-0022 1-CR-per-machine math), and conv never manufactures a redundant value on read or write (pkg/conv/conv.go:236-243, 275-281). interruption_probability (provider.proto:115-118) is provider-declared only — forecast for speculative, observed for real — with no cluster-side override; that is the second hard rule the cost formula depends on, alongside the formula itself being fixed.

`coordinator.proto` — the slow tier, pulled not pushed

The coordinator surface is detailed in coordinator-raft.md; here is what is load-bearing about its wire shape. The coordinator does not make provisioning decisions — it owns shard membership, the cluster→shard and topology-domain→shard maps, quota allocations, and the provider registry, and issues rebalancing instructions (api/proto/bigfleet/v1alpha1/coordinator.proto:1-6).

The data-plane RPC is one unary call: ReportShard(ShardReport) returns (ReportAck) (coordinator.proto:24). Shards pull — they report periodically (~30s), and the ReportAck piggybacks any pending CoordinatorInstructions plus the coordinator’s current term (coordinator.proto:235-249). The shard acks instructions on its next ReportShard via instruction_acks (coordinator.proto:172-189, 314-328). This deliberately keeps the v1 wire surface to one request/response RPC rather than a streaming Instructions RPC — and, more importantly, it is why the hot path has no inbound coordinator dependency: the coordinator cannot push to a shard; it can only answer a poll the shard chose to make. A coordinator outage stalls rebalancing and quota allocation, never the shard’s own decision loop. pkg/shard must not import pkg/coordinator, and the report client lives in pkg/shard/coordclient.

ShardReport (coordinator.proto:149-189) is stateless — the coordinator builds its fleet picture from the latest report per shard, discarding out-of-order delivery by cycle. It carries a coarse ShardSummary (bounded — sized to fit 200 peers in cache-resident state, coordinator.proto:191-210), the top-N Shortfall rows (bounded to 100, coordinator.proto:215-233), and shard_address for forward-compatible self-registration: an unknown shard_id causes the coordinator to Raft-Apply AddShard with that address and accept the report as a heartbeat (coordinator.proto:181-188). This is the closest thing to “registration” BigFleet has, and notably it is implicit and idempotent — there is no register/deregister API; a shard exists the moment it reports.

Every CoordinatorInstruction (coordinator.proto:254-271) is fenced by (coordinator_term, sequence_number); a shard rejects any instruction whose term is below its high-water mark (OUTCOME_REJECTED_STALE, coordinator.proto:317-323) — zombie-leader protection, the term-based analogue of the provider epoch fence. Instruction payloads are AssignDomain/UnassignDomain (domain granularity, not machine — ~100K entries not 100M, the §0.1 decision), ReassignSpeculative (pure quota bookkeeping, no real machines), CrossShardDrain, and TransferOwnership. Cross-shard machine reassignment is largely post-v1; the messages exist so the wire need not break when it lands. The admin/membership RPCs (AssignDomain, ListShards, ListQuotas, JoinRaftCluster, SnapshotSave, …) are leader-only and return FailedPrecondition on followers (coordinator.proto:26-61).

The CRDs — `bigfleet.lucy.sh/v1alpha1`

Three CRDs, all v1alpha1. The Go types are in pkg/apis/bigfleet/v1alpha1/; the YAML in api/crd/. The split of responsibility is clean: CapacityRequest is the only writable surface for users; AvailableCapacity and UpcomingNode are operator-written read-backs that exist so kubectl shows BigFleet acting.

`CapacityRequest` — the write path, one CR per pod

CapacityRequest (api/crd/bigfleet.lucy.sh_capacityrequests.yaml) is namespaced, declares a single pod’s resource need, and is the inverse of CapacityNeed: a user (or the optional unschedulable-pod controller) writes it, the operator reads and aggregates it. Withdrawal is implicit — deleting the pod garbage-collects the CR via ownerReferences, so the next roll-up simply omits it (capacityrequests.yaml:54-59), which dovetails with full-replacement roll-up semantics: the operator never has to emit an explicit “remove” because the absence is the removal.

The spec carries requirements (standard operators only — not Same, capacityrequests.yaml:161-194), resources (per-machine request), priority (int32, higher wins), interruptionPenalty/reclamationPenalty (dollars as int-or-string, bucketed by the operator — the CRD documents the fixed cost formula and the bucketing scheme inline, capacityrequests.yaml:129-160), topologySpread, and coLocation (the podAffinity projection that becomes Same). The status is a deliberately minimal two-phase lifecycle: Pending → Acknowledged, a single one-way transition written by the operator the first time a CR appears in a roll-up (capacityrequests.yaml:231-251, paper §6). The operator writes it via an idempotent JSON merge-patch on the status subresource — one apiserver call per CR, no resource-version precondition (markAcknowledged, pkg/operator/rollup.go:413-438).

A subtlety worth knowing: the engine recognises four outcomes (Pending / Acknowledged / Shortfall / Released, per ../api-reference.md), but the CRD status enum is only {Pending, Acknowledged} — the richer phase vocabulary is surfaced through other read-backs and metrics, not stamped back onto every CR. Acknowledged means “the shard accepted the roll-up and either has the inventory or is provisioning it”, not “your pods are scheduled” — satisfied-but-stuck is the cluster’s problem, never BigFleet’s (ADR-0045).

`AvailableCapacity` — the racy hint

AvailableCapacity (api/crd/bigfleet.lucy.sh_availablecapacities.yaml) is cluster-scoped, one per profile fingerprint, written from AvailableCapacityUpdate frames. The CRD docstring states the contract bluntly: real capacity is inherently racy; consumers must treat AvailableCapacity as a hint, not a reservation (availablecapacities.yaml:34-36, 60-70). It carries availability (the Confidence enum: None/Low/Medium/High), availableCount, cost (per-hour USD), and the matchable requirements/resources. status is intentionally empty in v1alpha1 — the spec carries everything (availablecapacities.yaml:132-136). The operator keys the CR name off the update’s supersedes_key (pkg/operator/upcoming.go:224-231), so the coalescing identity on the wire becomes the CR identity in the cluster — one stable object per profile that newer frames overwrite.

`UpcomingNode` — the placeholder, driven by `NodeStateUpdate`

UpcomingNode (api/crd/bigfleet.lucy.sh_upcomingnodes.yaml) is cluster-scoped, one per machine the shard is bringing up, so kubectl describe pod can show users BigFleet is acting on their unschedulable pod. Its spec (labels, resources, taints) is copied from the NodeStateUpdate frame’s node-shape fields, set once and stable because machine_id is stable (upcomingnodes.yaml:56-110). Its status phase is the projection of MachineState, and the mapping is the one place a divergence is worth flagging:

`MachineState` (wire)	`UpcomingNode` phase	source
`SPECULATIVE`, `CREATING`	`Provisioning`	`pkg/operator/upcoming.go:401-403`
`IDLE`	`Launched`	`upcoming.go:405-406`
`CONFIGURING`	`Registered`	`upcoming.go:407-408`
`CONFIGURED`	`Ready`	`upcoming.go:409-410`
`DRAINING`	`Draining`	`upcoming.go:411-412`
`FAILED`	`Failed`	`upcoming.go:413-414`

Divergence to know: the UpcomingNode phase vocabulary (Provisioning/Launched/Registered/Ready/Draining/Drained/Failed, upcomingnodes.yaml:166-174) is a Kubernetes-node-lifecycle projection, not a 1:1 rename of MachineState. CONFIGURING maps to Registered and CONFIGURED to Ready — the phase names describe what a cluster observer sees (a node registering, then becoming Ready), not the internal machine state. DELETING is not in the upcomingNodePhase switch at all and falls through the default to Provisioning (pkg/operator/upcoming.go:401-416). Drained is not produced by the pure MachineState→phase function either: a fully-drained machine returns to IDLE, which maps to Launched — so the handler infers Drained from the transition, promoting Launched to Drained only when the existing CR was in Draining (pkg/operator/upcoming.go:138-143), then deletes the CR at that terminus rather than leaving a stale-phase record (Drop AA, upcoming.go:145-164). The reclaim path also stamps Drained directly on per-node drain completion (pkg/operator/reclaim.go:145-161). The handler retries on apiserver Conflict (pkg/operator/upcoming.go:56-72, Drop S) because the shard does not spontaneously re-emit NodeStateUpdates — a dropped status write would otherwise strand the CR on a stale phase until some unrelated event refreshed it.

The status also carries nodeRef (set once the kubelet registers and the Node object exists), providerID (echoed for debugging, e.g. aws:///us-east-1a/i-…), provisioningStartTime, and lastError (upcomingnodes.yaml:111-185). The shard never reads these back — they are pure observability, written outbound only.

Versioning

Everything is v1alpha1 until v1 is cut. The compatibility bar: any field added under v1alpha1 after v1 is additive-only; breaking changes ship as v1alpha2, never as a silent rename (../api-reference.md “Versioning”). The reserved tombstones throughout the protos (CapacityNeed.count, Shortfall.count, the Hello capabilities field, the M66.1-removed ListFilter/Machine/AvailableCapacityUpdate fields) are how that bar is held on the wire: a removed field’s number is never re-used, so an old reader and a new writer never disagree about what a tag means. Generated Go bindings live in pkg/proto/bigfleet/v1alpha1/ and the CRD Go types in pkg/apis/bigfleet/v1alpha1/; both are generated — make generate is the single entry point, and hand-editing generated files is a working-discipline violation.

Wire protocols and CRDs in depth

The three representations and the one boundary

capacity.proto — the unit of demand

ClusterCapacityNeeds is a full replacement, always

CapacityNeed — a constrained aggregate resource request, not a pod count

Same is protobuf-only — and why the CRD can’t express it

PenaltyBucket — coarsening workload-specific dollars

shard.proto — one stream, all traffic

The operator dials out; the shard never listens inbound on a cluster

Hello first, and identity binding

Acknowledgement echoes term and epoch — for humans, not control flow