CONCEPT Cited by 2 sources
Feedback control loop for rollouts¶
Definition¶
A feedback control loop for rollouts is the change-management discipline of watching an observable signal as a phased change progresses, and pausing or aborting the rollout when the signal moves out of expected range — before fleet-wide exposure.
The name comes from control theory: a feedback loop measures system output, compares it to desired output, and drives input adjustments based on the error. Applied to rollouts: the measured signal is "user-facing issues"; the desired state is "no change in signal"; the control action is "stop the rollout."
The canonical verbatim statement (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage):
"As operations are issued, such as Redpanda or cloud infrastructure upgrades, we try to close our feedback control loops by watching Redpanda metrics as the phased rollout progresses and stopping when user-facing issues are detected."
The three required components¶
- Phased rollout substrate. The change moves through cohorts — cells, regions, customer tiers, canary → beta → production — with pause points between phases.
- Observable signal(s). Metrics the operator watches during each phase. Must be:
- Fast (seconds to minutes, not hours).
- Sensitive (detects customer impact before the customer notices).
- Specific (low false-positive rate — otherwise operators ignore or disable the signal).
- Aligned with customer experience (don't optimise for a proxy that decouples from harm).
- Pause/abort mechanism. The rollout tooling has a well-tested halt button. Halting is load-bearing: if pausing requires emergency ops rather than one click, engineers won't pause until confirmed damage.
Why the named mitigation list matters¶
The Redpanda retrospective enumerates feedback control loops as one of the canonical six mitigations for the butterfly effect in complex systems. Non-linearity makes pre-change analysis bounded in value; the only reliable reliability signal is measurement during the change itself.
The other five mitigations in the same list are not alternatives — they compose:
- Phasing change rollouts is the substrate for feedback control loops.
- Shedding load and applying backpressure are the control actions the feedback loop might trigger.
- Randomizing retries prevents retry storms from creating the load pattern that caused the original failure.
- Defining incident response is the human-scale fallback when the automated control loop isn't sufficient.
Relationship to canonical wiki patterns¶
- patterns/cohort-percentage-rollout — the 1% / 10% / 50% / 100% rollout discipline; each percentage is a pause point.
- patterns/staged-rollout — the general framing of phased deployments.
- patterns/progressive-cluster-rollout — the streaming-broker-specific framing for cluster upgrades.
- patterns/ab-test-rollout — feedback-loop-driven rollouts where the signal is an experiment metric.
- concepts/progressive-delivery-per-database — the PlanetScale framing: the rollout unit is the customer database.
- concepts/chaos-engineering — the complementary discipline: deliberately induce faults to validate that the feedback-control-loop alarms fire.
Instances across the wiki corpus¶
- Redpanda — broker upgrades + cloud-infrastructure upgrades are feedback-loop-gated (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage).
- PlanetScale extreme fault tolerance — "always be failing over" on every customer database every week is the feedback loop for the failover code path; see concepts/always-be-failing-over.
- Netflix Simian Army — Chaos Monkey is the feedback loop for instance-loss tolerance; see systems/netflix-simian-army.
- Cloudflare progressive deployment — edge software rolls through colocation tiers with feedback gating.
- AWS CodeDeploy / AppMesh / etc. — traffic-shift based rollout tooling with CloudWatch alarm-based auto-rollback.
Load-bearing assumption: fast, specific signals¶
The pattern fails silently if:
- Signal is too slow. A 15-minute aggregation window means the rollout reaches the next phase before the signal arrives from the previous phase.
- Signal is noisy. False positives train operators to override the halt; false negatives let real impact through.
- Signal is customer-decoupled. Pre-aggregate error rate across all customers can be fine while one customer segment is 100% broken.
This is why observability substrate quality is prerequisite to rollout-gating, not separable from it.
Caveats¶
- The sentence canonicalises intent, not implementation. The Redpanda post does not disclose the specific metrics watched, phase sizing, pause thresholds, or rollback procedure.
- Control theory names the shape; the engineering is the details. Control theory gives the structural framing (input, output, error, actuator); the application engineering has to choose each component.
- Not all changes can be feedback-loop gated. A breaking protocol change, a security patch that must be deployed fleet- wide immediately, or an idempotent config change that affects all workers simultaneously may not fit the model.
- Feedback control loops are complementary to pre-change review, not a replacement. They catch what review missed; review catches what measurement can't see.
- The control loop's own software can have bugs. The 2025 industry CrowdStrike / Cloudflare outages exposed how a bug in the rollout-gating path itself can defeat the discipline.
Seen in¶
- sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — Organisational-altitude instance. Slack's 18-month Deploy Safety Program (mid-2023 → Jan 2025) installed the feedback-control-loop discipline across the company's deployment systems. Load-bearing architectural phase-change: "Once automatic rollbacks were introduced we observed dramatic improvement in results." Composes into full organisational substrate — metrics-based deploy + automatic rollback on Webapp backend, generalised to Webapp frontend + infra; roadmap extends to EC2 / Terraform via centralised deployment orchestration. The 10-minute automated MTTR
- 20-minute manual MTTR + detect-before-10%-fleet North Stars (patterns/automated-detect-remediate-within-10-minutes, concepts/pre-10-percent-fleet-detection-goal) are the three quantitative axes of Slack's feedback-control-loop definition. 90% reduction in customer impact hours from change-triggered incidents, peak (Feb-Apr 2024) → Jan 2025.
- sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — canonical statement connecting control theory to phased rollouts; names the four named mitigations in the six-item reliability-practice list as complements.
Related¶
- concepts/butterfly-effect-in-complex-systems
- concepts/progressive-delivery-per-database
- concepts/chaos-engineering
- concepts/blast-radius
- concepts/pre-10-percent-fleet-detection-goal
- concepts/customer-impact-hours-metric
- concepts/trailing-metric-patience
- patterns/cohort-percentage-rollout
- patterns/staged-rollout
- patterns/progressive-cluster-rollout
- patterns/ab-test-rollout
- patterns/automatic-provider-failover
- patterns/automated-detect-remediate-within-10-minutes
- patterns/centralised-deployment-orchestration-across-systems
- systems/slack-deploy-safety-program