PATTERN Cited by 1 source
Automated job lifecycle promotion¶
Definition¶
The automated job lifecycle promotion pattern is a control-loop for managing multi-phase migrations across many jobs in parallel: every job continuously emits its current phase + its progress against the phase's promotion criteria to a telemetry substrate; an external tool reads the signals and automatically promotes or demotes the job between phases based on whether the criteria are still met.
The pattern is what makes multi-phase migrations operationally tractable when the job count is in the tens of thousands — manual gating per job per phase transition is impossible at that scale.
"Since we established a clear migration job lifecycle and job promotion criteria, the system continuously sent job status signals to Scuba, including data related to the lifecycle promotion criteria and the job's current stage in the migration lifecycle. We built external migration tools that continuously monitored signals from each job and automatically promoted or demoted jobs between stages of the migration lifecycle, depending on whether they met (or no longer met) the migration criteria. We also built system-level and job-level dashboards so engineers could quickly track the overall migration progress as well as monitor and debug individual jobs." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale
Three load-bearing properties¶
- Continuous signal emission. Each job emits its phase + criteria-progress on every cycle — not on phase boundaries, not on operator request. The telemetry stream is the canonical source of truth for migration state.
- Bidirectional promotion + demotion. A job can be promoted when it meets the next phase's criteria; a job can be demoted when it stops meeting its current phase's criteria (e.g. transient regression after a successful promotion). One-way promotion gating would trap jobs in broken states until manual intervention.
- Two-axis operator surface. System-level dashboard shows the aggregate (how many jobs are in each phase, how many are stuck, how many regressed); job-level dashboard shows the per-job state for debugging individual stuck jobs. The two together let operators triage without per-job manual tracking.
Why this pattern scales beyond manual migration¶
Manual migration scales as O(jobs × phases × engineer-hours). Automated lifecycle promotion scales as O(automation investment) + O(stuck jobs × engineer-hours):
- The investment in lifecycle definition + criteria + telemetry + dashboards + promotion logic pays back once.
- After the investment, only stuck jobs require operator time.
- The stuck-jobs proportion depends on the quality of the underlying systems + the rigor of the criteria; it can be made small.
Composes with¶
- patterns/shadow-then-reverse-shadow-migration — the three-phase lifecycle this control-loop drives.
- patterns/known-issue-exclusion-batch-selection — when a systemic issue is found, batch-level exclusion + this pattern's auto-demotion together prevent issue duplication in the signal stream.
- patterns/data-quality-analysis-tool-with-edge-case-logging — the periodic-tool primitive that turns raw mismatch logs into the criteria signals the control-loop reads.
Distinguishing from related shapes¶
- vs CI/CD progressive delivery: progressive delivery moves one binary through deployment phases (dev → staging → prod); automated job lifecycle promotion moves one job per data pipeline through migration phases (shadow → reverse shadow → cleanup) per-job in parallel.
- vs canary auto-rollback: canary auto-rollback is a single binary's phase 0 → 1 → 2 with rollback on regression; this pattern is the same shape generalised across many independent units at once with a richer demotion primitive (not just rollback to phase 0).
- vs manual gating: this pattern's identity move is eliminating per-job operator time for promotion / demotion decisions — operator time is reserved for stuck-job triage.
When to use¶
- Tens of thousands of jobs to migrate (or any count that exceeds reasonable per-job engineer-hours).
- Phase transitions are evaluable from criteria signals — you can express "is this job ready to advance?" as a function over emitted telemetry.
- A telemetry substrate is already operational — Scuba in Meta's case; Prometheus + Grafana, Datadog, or similar elsewhere.
- Demotion is acceptable — i.e. the migration tolerates jobs occasionally regressing back a phase rather than getting stuck.
When NOT to use¶
- Small migration (a few jobs) where the investment doesn't pay back.
- Phase transitions require manual judgment that can't be reduced to telemetry-evaluable criteria (e.g. UX or compliance signoff).
- Demotion is operationally unsafe — e.g. customer-facing state changes can't be cheaply reverted.
Seen in¶
- sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale — Meta's data-ingestion-system migration; canonical wiki instance driving tens of thousands of CDC jobs through the three-phase shadow → reverse shadow → cleanup lifecycle.
Related¶
- concepts/migration-job-lifecycle — the state machine this drives
- concepts/data-quality-checksum-comparison — criterion #1 mechanism
- concepts/landing-latency — criterion #2 SLI
- patterns/shadow-then-reverse-shadow-migration — the lifecycle this drives
- patterns/known-issue-exclusion-batch-selection — composes for noise reduction
- systems/meta-data-ingestion-system — canonical wiki instance
- systems/scuba-meta — the telemetry substrate
- companies/meta — company hub