Skip to content

PATTERN Cited by 1 source

State validation with auto-reapply and reboot

Intent

When applied configuration may be wiped by an out-of-band event (e.g. firmware upgrade), keep the live state converging on the declared state by (1) validating after every change, (2) re-applying if drift is detected, (3) triggering a reboot to make the re-applied state effective.

A reconciliation loop adapted to substrates where the effective state lives behind a reboot boundary.

Context

Firmware configuration is not guaranteed to persist across firmware upgrades"Configuration settings are often reset following a UEFI firmware upgrade" (Cloudflare 2026-06-01 core boot-time post). Any fleet automation that has declaratively set something at the firmware layer (boot order, secure boot config, hardware tunables) cannot trust that the setting will survive subsequent firmware-update operations. The upgrade itself is often what the automation is trying to do, so a one-shot apply isn't enough.

Mechanism

"To address these edge cases, we implemented a state validation step. The firmware automation now validates the configuration post-change: if it detects that settings have been modified, it re-applies the config and triggers a reboot."

Three structural pieces:

  1. Post-change validation step. After any firmware operation (upgrade, write, reset), the automation reads the current state of the relevant variables and compares against the declared state. The comparison is value-level (or hex-level — see patterns/hex-comparison-flag-for-ipxe-config-check for the iPXE-specific variant).
  2. Conditional re-apply. If validation reveals drift, the automation re-applies the declared configuration. This is idempotent: re-applying when no drift exists is a no-op.
  3. Reboot trigger. Firmware-level configuration usually only takes effect on the next boot. The automation explicitly triggers a reboot rather than waiting for a future natural one — otherwise the visible state diverges from the applied state until something else cycles the server.

Cloudflare's iPXE-script form of the validation step (verbatim):

# construct path to read the update variable
set buffer-var-guid 91468514-75bc-4bb5-8f33-91efff9e9b1f
set var-upd-path efivar/CfHIIVarUpd-${buffer-var-guid}

# Run the config change command
imgexec <signed CF UEFI configuration App> set ${uefi-setting}=${uefi-value}

# Compare the update variable with the expected value if it has changed.
# If it has changed, set the local variable to reboot the system
iseq ${uefi-same-hex} ${${var-upd-path}} || set has-changed ${uefi-diff-hex}

Operational trade-off (Cloudflare 2026-06-01)

"Although the first boot may take slightly longer, this change drastically reduces the time required for all future start-ups from about 20 minutes to less than a minute per subsequent boot."

The validation+reapply+reboot loop adds wall-clock to the first post-upgrade boot. Cloudflare deems this acceptable because the amortised gain — sub-minute subsequent boots vs ~20 min before the declared boot order takes effect — pays back the validation cost across the fleet.

Why "reboot" is part of the pattern (and not just "apply")

Most application-layer reconciliation loops apply config and the change is live immediately. At the firmware layer:

  • The variable change lands in NVRAM, but the effective boot path is determined at the next boot.
  • A long-running server can have its declared boot order set correctly but still boot incorrectly the next time it cycles for an unrelated reason — unless the apply step is paired with an explicit reboot to validate.

The reboot trigger turns the loop into apply → validate → reboot → next-boot-uses-correct-config, closing the loop on the firmware substrate.

Where this composes

When to use

  • Configuration substrate where settings can be wiped by an out-of-band event (firmware upgrade, factory reset, NVRAM realloc).
  • Effective state lives behind a reboot or restart boundary (firmware, kernel, system services with cold-load config).
  • Operating at fleet scale where manual re-apply per machine is impractical.

When not to use

  • Configuration substrate that is read continuously and enforced in real time (containerised config, sidecar-based policies) — the change can be detected and re-applied without a reboot.
  • One-off administrative work where a human will be present to notice drift.

Risks

  • Repeated reboot loop if the apply step keeps failing (e.g. the OEM's immutable setting blocks the write); the validation succeeds in detecting drift, but the reapply doesn't fix it. Add a bounded retry count + alarm.
  • Stale expected value — if the declared state is out-of-date for a platform variant, every server will be forced to reboot unnecessarily. Tie the declared state to source-controlled configuration.
  • Reboot-storm during a fleet-wide firmware upgrade if every server triggers its reapply+reboot at once. Roll out via a controlled cadence.
Last updated · 542 distilled / 1,571 read