Skip to content

CONCEPT Cited by 1 source

Heterogeneous fleet config skew

Heterogeneous fleet config skew is the failure mode where different hosts in the same fleet accumulate different configuration defaults over time — different package versions, different config-file defaults, different kernel parameters, different compile-time feature flags — and the skew surfaces only when the fleet is asked to perform a cross-host operation (migration, mirroring, failover, replication).

The canonical wiki phrasing

From Fly.io 2024-07-30:

Except: two different workers, for cursed reasons, might be running different versions of cryptsetup ... There are (or were) two different versions of cryptsetup on our network, and they default to different LUKS2 header sizes — 4 MiB and 16 MiB. Implying two different plaintext volume sizes.

Concretely:

  • Worker A runs cryptsetup version X; new Volumes get a 4 MiB LUKS2 header.
  • Worker B runs cryptsetup version Y; new Volumes get a 16 MiB header.
  • A migration from A to B requires the target to create a dm-clone device with the source's plaintext size. If the target worker just reads its own cryptsetup defaults, the clone is wrong-sized and the migration breaks.

Fix: Fly.io extended flyd's migration FSM with an RPC that carries the source's LUKS2 header metadata to the target worker, so the target creates a clone device of the correct plaintext size. "Not something we expected to have to build, but, whatever." Canonical patterns/fsm-rpc-for-config-metadata-transfer instance.

Why it's common

  • Fleet age is not uniform. Workers are built and patched over months or years; version-pinning strategies vary per component.
  • Config-file defaults drift upstream. cryptsetup's LUKS2 default header size changed across versions without changing the on-disk format compatibility — so the skew isn't a bug but a default drift.
  • Environment-dependent defaults. Some tools pick defaults based on kernel version, disk size, or build-time flags, so two workers with the same package version might still diverge.
  • Operators don't notice until a cross-host operation surfaces it.

Defence patterns

  • Force-record the config per resource, not per host. Store the authoritative parameters on the resource (the Volume's LUKS2 header, the image's content-hash, the cluster's version-pin) so any consumer can reconstruct the creation-time shape.
  • Carry config metadata in cross-host protocols. Fly's FSM RPC. The Kubernetes Pod-spec's containerImage: field (carries the image-pull identity across nodes). etcd's versioned schema.
  • Regular fleet homogenization scans. Detect skew before it surfaces in a migration.
  • Harden cross-host operations to read the source's parameters, not the target's defaults.

Seen in

Last updated · 200 distilled / 1,178 read