Skip to content

CONCEPT Cited by 1 source

Fallback cron for self-update safety

Definition

Fallback cron for self-update safety is the reliability discipline that when a system X is responsible for its own updates (X applies its own new versions), X cannot be the sole mechanism for applying those updates. If X's running code has a bug that prevents X from triggering itself, no new X can ever be deployed.

The mitigation: pair X with a second, independent trigger whose code path does not depend on X's correctness. The canonical shape is a time-based fallback (cron) running alongside the primary event-based trigger, with a looser SLA (enough to catch X's breakage, not so frequent it defeats X's efficiency).

The canonical wiki instance is Slack's Chef Summoner design (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption): Chef Summoner on every node is responsible for triggering Chef runs, and Chef runs are responsible for keeping Chef Summoner itself up to date. A broken Summoner deployment cannot be fixed via the normal path. Slack's mitigation is a fallback cron baked into every AMI that independently checks Summoner's state and runs Chef directly if Summoner hasn't done so in the last 12 hours.

The self-update paradox (verbatim from Slack)

"Now that Chef Summoner is the primary mechanism we rely on to trigger Chef runs, it becomes a critical piece of infrastructure. After a node is provisioned, subsequent Chef runs are responsible for keeping Chef Summoner itself up to date with the latest changes. But if we accidentally roll out a broken version of Chef Summoner, it may stop triggering Chef runs altogether — making it impossible to roll out a fixed version using our normal deployment flow."

The paradox has the shape:

┌─────────────────────────────────────────────────────┐
│                                                     │
│  Chef Summoner  ──triggers──▶  Chef Run             │
│       ▲                              │              │
│       │                              │              │
│       └─── updates (applies) ────────┘              │
│                                                     │
└─────────────────────────────────────────────────────┘

If the "triggers" arrow breaks, the "updates" arrow can't fire, so the "triggers" arrow can never be fixed.

The three required properties of the fallback

  1. Independence. The fallback code path must not share failure modes with X. A fallback that uses the same signal-reading code as X won't help when the signal-reading code is broken. Slack's fallback cron is "baked in on every node" at AMI build time — it doesn't load from Chef, doesn't watch the signal bus, doesn't consume any of Summoner's abstractions.
  2. Trigger-on-absence-of-primary. The fallback's actuation condition is "has the primary worked recently?" rather than "run always." Slack's fallback "checks the local state Chef Summoner stores (e.g., last run time and artifact version) and ensures Chef has been run at least once every 12 hours". If Summoner has been healthy, the fallback does nothing; if Summoner has been silent for

    12 hours, the fallback fires.

  3. Looser SLA than the primary. The fallback is a safety net, not the main path. A fallback that runs every minute would defeat the efficiency benefit of the signal-driven primary. Slack's 12-hour window is deliberately longer than the signal path's normal sub-hour latency.

Why "looser SLA" matters

If the fallback fires too aggressively: - It re-introduces the cron-load-vs-compliance trade-off the signal model was designed to avoid. - It masks primary failures — operators don't notice X is broken because Y keeps things running. - It creates coordination issues between X and Y (both might run simultaneously, wasting work).

The right SLA is: "long enough that operators notice X is broken and fix it before Y has to repeatedly paper over the failure." Slack's 12 hours lines up with their Chef compliance window (nodes must run Chef at least once every 12 hours for compliance) so the fallback serves two purposes: safety net for Summoner + compliance-floor enforcement.

The "recovery path" framing

The fallback isn't just there to keep Chef running during a Summoner outage — it's specifically there to give operators a recovery path to deploy a fixed Summoner:

"If the cron job detects that Chef Summoner has failed to run Chef in that timeframe, it will trigger a Chef run directly. This gives us a recovery path to push a working version of Chef Summoner back out."

The load-bearing property: within 12 hours of a bad Summoner deploy, every node runs Chef via the fallback path; that Chef run applies the fixed Summoner. The system heals itself via the fallback even though the primary is broken.

Generalisation

This pattern applies anywhere a self-updating system exists:

  • Configuration-management agents (puppet-agent, chef- client, cfengine, salt-minion) — all have the same self- update paradox.
  • Container orchestrator agents (kubelet, nomad-client).
  • Serviced-by-itself platforms — any platform where the platform is deployed via itself.
  • Auto-updating package managers (apt, yum with self-update enabled).
  • Feature-flag clients that control whether the feature- flag client itself is active.

The general shape: - Primary path: efficient, signal-driven, fast. - Fallback path: independent, time-driven, slow.

Caveats

  • The fallback must be tested — ideally in production, ideally regularly. A fallback path that only fires during incidents is a fallback path that atrophies. Slack's 2025-10-07 Deploy Safety retrospective emphasises the discipline that mitigations must be validated, not assumed.
  • The fallback can itself break. If the fallback cron is configured via Chef, a broken Chef deploy can break the fallback. The strongest version of the pattern has the fallback baked into the AMI at build time, not configured post-boot — Slack's choice.
  • The fallback-primary boundary can blur. If operators rely on the fallback routinely, it becomes the primary. Organisational discipline matters: the fallback is for emergencies, not for daily operations.

Composes with

Seen in

Last updated · 470 distilled / 1,213 read