Skip to content

CLOUDFLARE 2026-06-01 Tier 1

Read original ↗

Cloudflare — How we reduced core unit boot time from hours to minutes

Summary

Cloudflare engineering post (2026-06-01) by the OpenBMC team narrating a fleet-wide regression in core-server boot time — following a routine firmware update, Gen12 core servers (the Gen12 generation, "nearly 2,000 units", running Cloudflare's centralised control plane / billing / analytics on bare metal) began taking four hours to come back online instead of minutes, stretching one-day rollouts into multi-day slogs and forcing engineers to babysit upgrades that should have run unattended. Root cause: "an over-eager linear search through every available network boot interface." During the network boot stage of the UEFI / iPXE sequence, the server attempts each available boot interface in a fixed order — IPv4 HTTPS → IPv4 iPXE → IPv6 HTTPS — and waits for each to time out before falling through to the next; each failed attempt burns "roughly five minutes" and the IPv6 HTTPS interface that actually works in this fleet sits at the end of the list, so every boot wastes ~20 minutes of timeout and a firmware upgrade (which requires multiple sequential reboots, one per component) compounds those penalties into nearly four hours of idle waiting per server. The fix is the declare-boot- interface-order-upfront pattern: rather than letting the server discover the right interface, the boot automation declares it upfront in the pre-boot PXE stage for each hardware/use-case combination, eliminating the linear-search timeout amplification entirely. Three ancillary obstacles required real engineering. (1) Legacy + persistence: older UEFI versions don't support boot ordering at all, and "configuration settings are often reset following a UEFI firmware upgrade" — so the automation now runs a [[patterns/state-validation-with-auto- reapply-and-reboot|state-validation step]] that detects drift post-change, re-applies the config, and triggers a reboot. (2) Vendor-locked immutable setting: an OEM Force Priority Httpv4 Httpv6 Pxev4 Pxev6 token blocked boot-order changes entirely, plus the EFI_IFR_REF3 "Network Boot Interface" data structure was lazy-loaded — "not instantiated until it is explicitly accessed via a GUI callback" so it was invisible to programmatic scans. Required a new BIOS version from the vendor and a debug session, with vendors enabling specific tokens within the fixed "Boot Order Module" that forces discovery during the boot sequence without requiring manual GUI interaction. (3) NIC vendor string drift: different NIC vendors emit different boot-interface strings (e.g. UEFI: HTTPS IPv4 Ethernet Network Adapter XXX-XXX-Y for OCP 3.0 P1 vs UEFI: HTTPS IPv4 Network Adapter - 50:00:E6:8F:4F:32 P1), causing exact-match config to mismatch — canonicalised at [[concepts/vendor-string-mismatch-on-fleet- config|vendor-string mismatch]] and resolved with the wildcard match .*HTTP.*IPv4.*P1 feature added to Cloudflare's CfHIIConfig_App tool. Cloudflare is "currently working with [their] UEFI vendors to standardize the network interface strings" (drop MAC + product details, keep protocol + transfer type + port + slot index — read product details from the NIC's embedded vital product data when needed) so wildcards can eventually be retired. (4) iPXE reads variables as HEX, which broke direct equality checks against expected string values, so the team added a uefi-same-hex boolean flag that lets the automation run a single set command instead of show followed by set ( hex-comparison flag). Headline operational outcomes: firmware-upgrade automation 4 hours → 3 minutes; subsequent single boot ~20 minutes → less than a minute. The endpoint is "a more dynamic system" where "a single BIOS firmware image serves all SKUs, configuration updates deploy at scale through our existing release pipeline, and the entire workflow operates from iPXE."

Key takeaways

  1. The bug was not a firmware regression — it was the absence of a declared boot order. When the post-update reports came in (servers stuck pre-OS), the team's first hypothesis was that the firmware update introduced a hang. Serial console showed the opposite: POST completed normally, hardware initialisation was healthy, but "instead of quickly reaching the network boot stage and pulling down an OS image, the server sat waiting. And waiting." The symptom was a hang; the cause was a sequence of timeouts from probing interfaces that would never respond.
  2. Linear search through boot interfaces is the canonical timeout-amplification failure mode. Verbatim: "the system was attempting an IPv4 HTTPS network boot, timing out after several minutes, then trying IPv4 iPXE, timing out again, then repeating both — all before finally reaching the IPv6 HTTPS boot interface that would actually succeed. Every failed network boot attempt burned roughly five minutes waiting for a timeout response. With four attempts stacking up before the correct interface was reached, a single boot cycle wasted around twenty minutes."
  3. Firmware-upgrade automation compounds per-boot waste into per-upgrade hours. "For a routine reboot, that's painful. For firmware upgrade automation, which requires multiple sequential reboots, one per component, those twenty-minute penalties compounded into nearly four hours of idle waiting per server." The bug existed before the firmware update — the update merely shifted which interface succeeded, exposing the timeout-amplification pattern that was structurally present all along.
  4. Two structural constraints make boot-order declaration non-trivial: legacy UEFI versions and config persistence. "Boot ordering is not supported on older UEFI versions" — so any solution must coexist with un-orderable hardware. "Configuration settings are often reset following a UEFI firmware upgrade" — so the automation cannot assume its declared order survives the very upgrade it's trying to accelerate. Both lead to the same operational primitive: validate-then-reapply-then-reboot.
  5. State validation with auto-reapply is the durable solution to firmware config persistence. "To address these edge cases, we implemented a state validation step. The firmware automation now validates the configuration post-change: if it detects that settings have been modified, it re-applies the config and triggers a reboot." The first boot after an upgrade may take slightly longer (because of the validation
  6. reapply + reboot loop), "but this change drastically reduces the time required for all future start-ups from about 20 minutes to less than a minute per subsequent boot."
  7. Vendor coordination is on the critical path for fleet automation. Two vendor obstacles blocked the fix until coordination resolved them. First, the EFI_IFR_REF3 structure was lazy-loaded by design ("standard industry practice to accelerate BIOS boot times") which made the network-boot-interface invisible to programmatic scans "because the structure hadn't been 'loaded' yet." Second, an immutable OEM setting (Force Priority Httpv4 Httpv6 Pxev4 Pxev6) prevented changing the boot order at all. Both required "a new BIOS version from our vendor and a debug session," with the OEM enabling tokens within the "Boot Order Module" that force discovery without GUI interaction.
  8. NIC vendor string drift is a fleet-wide configuration hazard, mitigated by wildcards as a stop-gap and string standardisation as the durable fix. Verbatim examples show the magnitude: UEFI: HTTPS IPv4 Ethernet Network Adapter XXX-XXX-Y for OCP 3.0 P1 vs UEFI: HTTPS IPv4 Network Adapter - 50:00:E6:8F:4F:32 P1 — same boot interface, irreconcilably different strings. The wildcard workaround is .*HTTP.*IPv4.*P1 (added to Cloudflare's CfHIIConfig_App tool); the durable fix is a standardisation effort with UEFI vendors to "only make use of the relevant information (e.g. protocol, transfer type, port number, and physical slot index) and drop the product details like the MAC address. The product details, if needed, can be read from the embedded vital product detail information of the network interface card."
  9. A small primitive — a boolean flag — replaces a two-step show-then-set with a single set, when the read path is broken. "Since iPXE reads this variable as HEX, it was reading the string output as hex. To check if the network boot setting was modified and to reduce boot time (so we don't have to print the variables before setting them), we implemented a boolean flag, uefi-same-hex, to indicate whether a configuration changed. This enabled us to run a single set command instead of first running show to compare, and then set if the configuration was not in the desired state." Reduces both wall-clock and round-trip count.
  10. The endpoint is "a more dynamic system" with one BIOS image for all SKUs. Verbatim: "a single BIOS firmware image serves all SKUs, configuration updates deploy at scale through our existing release pipeline, and the entire workflow operates from iPXE." Configuration becomes part of the release pipeline, not an out-of-band per-server knob — a generalisation of configuration-as-code applied to firmware/BIOS at fleet scale.

Architecture and numbers

Datum Value
Affected fleet Gen12 core servers
Fleet size nearly 2,000 units
Firmware-upgrade automation time, before nearly 4 hours
Firmware-upgrade automation time, after 3 minutes
Subsequent single-boot time, before about 20 minutes
Subsequent single-boot time, after less than 1 minute
Per-failed-interface timeout ~5 minutes
Failed interfaces probed before success 4 (IPv4 HTTPS / IPv4 iPXE / IPv6 HTTPS / IPv6 iPXE — the actually-succeeding interface, IPv6 HTTPS, is at the end of the search)
Per-boot wasted time on linear search ~20 minutes
Wildcard config match string .*HTTP.*IPv4.*P1
Vendor lockout setting Force Priority Httpv4 Httpv6 Pxev4 Pxev6
Lazy-loaded data structure EFI_IFR_REF3
iPXE flag for HEX-read short-circuit uefi-same-hex
Cloudflare-side tool CfHIIConfig_App
Iterations needed for fix new BIOS from vendor + debug session

Boot sequence (verbatim)

"Our boot automation flow is in three broad stages: firmware initialization, pre-boot, and kernel startup. After power on, the UEFI firmware does some hardware and peripheral initialization followed by the PXE pre-boot environment. The pre-boot sets up the network card and executes a small program called bootloader, which kickstarts the kernel. It's in this PXE stage that various network interfaces are probed for the right one. On first boot, firmware upgrades are included in our boot automation workflow."

The three-stage decomposition is what makes the fix possible: boot order declaration happens early in the pre-boot PXE stage so the linear search through interfaces never starts.

Systems extracted

  • systems/uefi — Unified Extensible Firmware Interface, the modern firmware standard that initialises hardware and hands off to the OS. The substrate the post operates on.
  • systems/ipxe — Open-source network boot firmware supporting modern protocols (HTTP/HTTPS) — the "programmable workflow" layer that makes boot automation feasible (Cloudflare's "automation strategies that ultimately solved the problem").
  • systems/cloudflare-gen12-server — Cloudflare Gen12 fleet of core (centralised data centre) bare-metal servers, ~2,000 units; canonical wiki face for the post.

Concepts extracted

  • concepts/network-boot-interface — the abstraction that lets a server boot its OS over the network rather than local storage; primary instances are PXE and UEFI HTTPS boot.
  • concepts/linear-search-timeout-amplification — the failure mode where probing a sequence of options with per-option timeout produces wall-clock cost proportional to the position of the first-succeeding option × the timeout. The post's canonical instance: 4 × 5 min = 20 min per boot.
  • concepts/firmware-config-persistence-loss — UEFI config settings are "often reset following a UEFI firmware upgrade"; any automation that depends on persisted UEFI config must validate and re-apply post-upgrade.
  • concepts/vendor-string-mismatch-on-fleet-config — same logical interface emits different identifying strings depending on NIC vendor; exact-match config breaks; need wildcards (stop-gap) or string standardisation (durable).
  • concepts/lazy-loaded-bios-data-structure — BIOS data structures (e.g. EFI_IFR_REF3) deferred until accessed via a GUI callback to accelerate cold-path BIOS boot; renders the data invisible to programmatic scans.

Patterns extracted

  • patterns/declare-boot-interface-order-upfront — eliminate guesswork in the network boot stage by declaring the correct interface upfront in the pre-boot PXE stage per hardware/use- case, so the firmware never falls through the timeout gauntlet. Cloudflare's headline pattern.
  • patterns/state-validation-with-auto-reapply-and-reboot — after any config change (and after firmware upgrades that may reset the config), validate the new state programmatically; if it doesn't match expectations, re-apply and trigger a reboot.
  • patterns/wildcard-config-match-for-vendor-string-drift — use a regex-style wildcard against vendor-specific configuration strings when exact match is unreliable; pair with a string-standardisation roadmap so the wildcard can retire.
  • patterns/hex-comparison-flag-for-ipxe-config-check — small primitive: a boolean flag exposed to iPXE indicating whether a configuration matches the desired hex representation, used to short-circuit a show + compare + conditional set sequence into a single set.

Caveats

  • Tier-1 Cloudflare post on infrastructure / firmware internals — passes scope decisively (production incident with operational numbers, fleet-scale firmware automation, multi-vendor coordination, named subsystems).
  • No latency p-values — the only quantitative numbers are the headline before/after totals (4h → 3min, ~20min → <1min) and the per-timeout 5-min figure. No per-step latency histograms, no distribution of which servers experienced which interface ordering.
  • Specific OEM not named — the post identifies the immutable setting verbatim (Force Priority Httpv4 Httpv6 Pxev4 Pxev6) and the data structure (EFI_IFR_REF3) but does not name the BIOS vendor or NIC vendors. The wildcard example pair is sufficient to identify the failure mode but not the vendors.
  • No DRAM/memory training time — POST is described as completing normally; the 4-hour figure is dominated by network-boot-interface probing, not by hardware initialisation.
  • CfHIIConfig_App is referenced but not open-sourced or deeply documented — the post says "we had to implement an additional feature to the CfHIIConfig_App tool, allowing it to set the config without having the full string" but the tool itself is not described beyond the wildcard-matching feature it gained.
  • The string standardisation effort is in-flight"We are currently working with our UEFI vendors to standardize the network interface strings" — outcome and timeline not disclosed; wildcards remain the operational primitive in the meantime.
  • No regression-test or chaos discipline disclosed — the post does not say whether boot-time regressions of this shape now have alarm coverage, or whether boot-order declarations are validated continuously against the fleet's actual interfaces.
  • The fix's interaction with secure boot / measured boot is not addressed — declaring a specific boot interface upfront could in principle interact with attestation chains (any change to BIOS config produces a different PCR); Cloudflare doesn't discuss this.
  • First-boot for new nodes still pays the cost — the post notes new nodes face "the full timeout gauntlet on their very first boot" before the declared order takes effect; the validate-then-reapply-then-reboot pattern means the first boot may take slightly longer than baseline, traded for sub-minute subsequent boots.
  • Edge fleet not affected — explicitly limited to Cloudflare's core (centralised control-plane / billing / analytics data centres), not the globally-distributed edge POPs. The blog distinguishes the two upfront.
  • No comparison vs alternative approaches — declarative boot order vs in-band BIOS update vs just-disabling- unused-interfaces vs always-PXE-then-iPXE-only — only the chosen path is described. The choice of "declare order per hardware/use-case" over "disable all but the right one" is not explicitly justified.
  • No OEM/BIOS firmware version numbers"new BIOS version from our vendor" is the only altitude; specific versions or vendor identities not disclosed.

Source

Last updated · 542 distilled / 1,571 read