PATTERN Cited by 1 source
Self-update with independent fallback cron¶
Problem¶
A system X that is responsible for its own updates has a classic chicken-and-egg failure mode: if the running version of X has a bug that prevents X from triggering itself, no new version of X can ever be deployed through the normal path — because the normal path is X.
This shape is everywhere in fleet-management infrastructure:
- Chef Summoner on every node triggers Chef runs, and Chef runs update Chef Summoner. A broken Summoner can't fix itself.
- kubelet on every node triggers pod updates, including updates to kubelet itself.
- puppet-agent, salt-minion, and similar config-management agents update their own binaries.
- Self-updating package managers update themselves via the same path they update packages.
- Deployment orchestrators that orchestrate their own deployment.
A normal-path push of "fixed X" requires X itself to work, but X is broken. Without a second path, the only recovery is manual ops on every node — not viable at fleet scale.
Solution¶
Pair the self-updating system with an independent fallback trigger that can invoke the core action of X without depending on X's running code. The fallback is usually a time-based cron with a looser SLA than the primary signal- based trigger, and its code path must not depend on X's correctness.
When X has been silent for longer than the fallback's SLA, the fallback fires X's core action directly. That action applies whatever updates are pending — including updates to X itself, which means a bad-X deploy is self-healing within the fallback window.
The canonical wiki instance is Slack's 2025-10-23 Chef phase-2
design (Source:
sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption):
a fallback cron baked into every AMI at build time that
runs independently of Chef Summoner,
reads Summoner's local state, and triggers chef-client
directly if Summoner hasn't run it in the last 12 hours.
The three required properties¶
1. Independence from the self-updating primary¶
The fallback's code path must share no failure modes with
the primary. Test: if the primary's code is rm-ed from the
system, can the fallback still fire? For Slack:
- Fallback cron lives in the AMI at build time — doesn't get installed by Chef.
- Fallback reads Summoner's state file — doesn't require Summoner to be running.
- Fallback invokes
chef-clientdirectly — doesn't go through Summoner's signal-reading or dedup logic.
2. Triggered by absence-of-primary, not by schedule¶
The fallback isn't just a periodic cron that always fires — it fires when the primary has been absent for too long. For Slack, the cron checks "the local state Chef Summoner stores (e.g., last run time and artifact version) and ensures Chef has been run at least once every 12 hours. If the cron job detects that Chef Summoner has failed to run Chef in that timeframe, it will trigger a Chef run directly."
This matters because: - A fallback that always fires re-introduces the cron load the signal-driven primary was designed to eliminate. - A fallback that never fires is broken (not a real fallback). - The absence-triggered model quietly waits when the primary is healthy.
3. Looser SLA than the primary¶
The fallback's SLA must be: - Short enough to matter (i.e., the system can't stay broken long enough to violate higher-level compliance). - Long enough to observe-and-fix the primary before the fallback fires routinely (otherwise the fallback becomes the primary).
Slack's 12-hour SLA lines up with their Chef compliance window (Chef must run at least once every 12 hours on every node) — the fallback serves as both the Summoner-breakage safety net and the compliance-window guarantee.
How this enables recovery¶
The key insight from Slack: the fallback doesn't just keep Chef running during a Summoner outage — it gives operators a recovery path for deploying a fixed Summoner.
T=0 Slack deploys bad Summoner version V2.
Summoner V2 is broken; it stops triggering Chef.
T=11h Fallback cron wakes up on each node, checks
Summoner's last-run timestamp, sees > 12h.
T=12h Fallback triggers chef-client run directly.
Chef run pulls latest cookbook version, which
includes fixed Summoner V3.
T=12h+ Each node now has Summoner V3.
Signal-driven path resumes normal operation.
Without the fallback, the recovery path at T=12h would be: someone notices the fleet is stuck on V2 → manual ops to push V3 to every node → days of recovery. With the fallback, the fleet heals itself in ~12 hours max.
Where to put the fallback¶
The fallback's "install location" is load-bearing. Options:
- AMI / base image (Slack's choice). Baked in at image build time. Survives any post-boot state. Hardest to accidentally break.
- Package / systemd unit deployed by a different substrate — e.g., a kickstart script or post-boot provisioning that runs independently of the primary system. Medium-strong independence.
- Inside the primary's own config. Weakest independence — if the primary's config is what broke the primary, the fallback is also broken. Not recommended.
Slack's choice (AMI / base image) is the strongest: the fallback physically cannot be broken by any runtime-time action short of rebaking the AMI.
Generalisation¶
The pattern applies to any self-updating subsystem:
- Kubernetes kubelet + systemd. Kubelet updates pods (including itself); systemd restarts kubelet if it crashes.
- Ansible-pull + cron. Some ansible-pull deployments use the same technique — cron runs ansible-pull, which applies updates including to the cron itself.
- Self-updating auto-scaler + hard-coded cooldowns. An auto-scaler that can scale itself down to 0 benefits from a hard-coded minimum-instance count enforced outside the auto-scaler's own logic.
- Feature-flag client that controls the feature-flag client. A flag that disables flag-fetching needs an absolute-override enforced outside the flag client.
Composes with¶
- concepts/fallback-cron-for-self-update-safety — the underlying concept.
- patterns/signal-triggered-fleet-config-apply — the primary path this pattern protects.
- patterns/watchdog-bounce-on-deadlock — a sibling at a different altitude (watchdog bounces a hung process).
- concepts/always-be-failing-over — PlanetScale's "exercise the failure path routinely" discipline.
Caveats¶
- Exercise the fallback. A fallback that only runs during incidents atrophies. Slack's 12-hour window doubles as a routine-test path because it fires for nodes that haven't been promoted to.
- Fallback should alert. If the fallback fires frequently, that's a strong signal the primary is broken — operators should be paged on "fallback invoked", not just on the primary's failure.
- Fallback must be simpler than the primary. A fallback
as complex as the primary has the same failure-mode
probability. Slack's "run
chef-clientdirectly" is much simpler than Summoner's signal-reading pipeline. - The fallback creates a two-path invariant. Both paths
must leave the node in the same state. If Summoner applies
cookbook V3 differently from a direct
chef-clientrun, nodes on different paths diverge. - Fallback interval and primary reliability interact. Shorter fallback intervals mean tighter SLA but more fallback-invocation if the primary is flaky. Longer intervals mean less noise but longer recovery windows.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption
— canonical: fallback cron baked into every Slack AMI,
triggering direct
chef-clientruns if Summoner hasn't run Chef in the last 12 hours; explicitly framed as the recovery path for broken-Summoner deploys.