Skip to content

CONCEPT Cited by 4 sources

Circular dependency (deployment context)

A circular dependency in the deployment context is the failure mode where the act of deploying a fix for a service depends — directly or indirectly — on the very service the fix is trying to restore. If the service is down, the fix can't ship; if the fix can't ship, the service stays down; mean time to recovery expands unboundedly.

The canonical instance is dogfooded SaaS: GitHub deploys itself on github.com, Amazon sells on AWS, Cloudflare operates Cloudflare via Cloudflare Workers, Google depends on Google's auth. Dogfooding is a quality-and-speed win in steady state and a reliability anti-feature during incidents. GitHub frames the baseline mitigations as "maintaining a mirror of our code for fixing forward and built assets for rolling back" — but circular dependencies can still re-enter through deploy scripts themselves. (Source: sources/2026-04-16-github-ebpf-deployment-safety)

Three classes (GitHub taxonomy)

The GitHub Engineering post catalogues three classes, ordered by detectability:

  1. Direct dependency. The deploy script itself pulls an artifact from the impaired service. Example: a MySQL deploy script runs curl https://github.com/foo/bar/releases/.... Detectable by static source review of deploy scripts. Easiest to catch but "many dependencies aren't identified until an incident occurs" because review is manual and team-by-team.

  2. Hidden dependency. A tool already installed on the host contacts the impaired service at runtime (e.g. auto-update check on startup). The deploy script calls the tool, the tool calls out, the call hangs or fails, and the script inherits the failure. Detectable only by runtime audit — what does this tool actually contact? — not by reading the deploy script. Tool-inventory audits + network observation find these; source review does not.

  3. Transient dependency. The deploy script calls an internal service (e.g. a migrations service, a secrets service) which, within its own code path, pulls something from the impaired service. The failure propagates back. Detectable only by walking the full call graph including internal services. Adding a transient dependency is especially easy because the deploy script's own source code looks clean — the violation lives one hop away.

Why dogfooded architectures re-learn this repeatedly

  • Every team adds new scripts + new tools continuously. Today's clean deploy script may pick up a transient dependency tomorrow when a downstream service's team adopts a new library.
  • Tool updates are opaque. A pinned tool on the host silently picks up a new auto-update-check call in its next patch release.
  • Review is structurally incomplete. Static source review only catches class 1; audits at tool-update time only catch class 2; dependency-graph walks only catch class 3 if the graph is maintained.
  • Failure is rare in steady state — dogfooded internal services are reachable 99.9x % of the time, so a circular dependency is latent most of the year, surfacing only during the worst possible conditions (a real incident).

The tribal-knowledge shape doesn't scale. A structural fix — enforcement of the invariant "deploy scripts cannot talk back to the impaired service" — is what GitHub reaches for.

Sibling shape: observability-substrate circular dependency

The same structural pattern surfaces at the observability-stack altitude. Airbnb's 2026-05-05 post names the failure mode verbatim:

"What happens when your observability stack is dependent on the same systems that are failing? In that moment, the dashboards go dark, alerts stop firing, and the tools meant to guide recovery become part of the outage." (Source: sources/2026-05-05-airbnb-monitoring-reliably-at-scale)

Three concrete circular dependencies Airbnb enumerates:

  1. Compute. Observability stack runs on the same Kubernetes clusters as the product services it monitors. Cluster incident → both down simultaneously. Remedy: patterns/dedicated-observability-kubernetes-clusters (dedicated clusters administered by the platform team; see concepts/dedicated-but-managed-infrastructure).

  2. Networking. Metrics flow through the same Istio service mesh as business traffic. Verbatim: "metrics for the data plane would depend on that same data plane to be delivered." Remedy: patterns/custom-l7-proxy-for-telemetry-over-service-mesh (purpose-built Envoy ingress tier, independent of the shared mesh). Motivating asymmetry: concepts/observability-traffic-volume-asymmetry.

  3. Meta-monitoring regress. If the monitoring layer that watches the main stack also fails, who watches it? Without care, "spinning up yet another monitoring stack would just lead to an infinite regress." Remedy: patterns/heartbeat-absence-as-alert-trigger — a dead-man's switch that exits to an external control plane (Airbnb uses AWS SNS + CloudWatch) terminates the regress because the watchdog runs on infrastructure distinct from the observability stack.

The three Airbnb instances generalise to a design bar stated in the post: "treat monitoring as a production system whose availability must exceed that of what it observes." The underlying discipline is the same as the GitHub deployment-context instance above: identify the substrate the recovery / observability path shares with the monitored workload, and sever the sharing.

Sibling shape: forecast-context circular dependency

The same structural shape — output depends on input, input depends on output — reappears in the predictive-autoscaling control-loop context. MongoDB's 2026-04-07 retrospective names it "circular dependency" explicitly:

"We can't just train a model based on recent fluctuations of CPU, because that would create a circular dependency: if we predict a CPU spike and scale accordingly, we eliminate the spike, invalidating the forecast." (Source: sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment)

Same structural hazard, different layer of the stack:

Deployment context Forecast context
Trigger "Fix the service" "Predict the metric"
Dependency Fix depends on service being reachable Metric depends on control action
Symptom Can't ship during incident → MTTR expands unboundedly Forecast accuracy degrades under use
Remedy Mirror / independent infra + cGroup-scoped isolation Forecast exogenous inputs (concepts/customer-driven-metrics), then model the endogenous response

Both are named "circular dependency" in their respective sources. The forecast-context variant is documented in detail on the concepts/self-invalidating-forecast page; this page retains its focus on the deployment-context shape.

Structural fix: deploy-script-scoped egress filter

GitHub's post introduces eBPF + cGroups as the enforcement primitive:

  • Place deploy scripts in a dedicated cGroup.
  • Attach eBPF programs to that cGroup only.
  • Block outbound traffic to the dogfooded service from that cGroup only — the rest of the host (which may be serving customer traffic) is unaffected.
  • Crucially, this fails the deploy script visibly and immediately rather than silently hanging — surfacing the circular dependency as a deterministic pre-production test rather than a post-incident lesson.

See patterns/cgroup-scoped-egress-firewall and patterns/dns-proxy-for-hostname-filtering for the specific shape, and concepts/egress-sni-filtering for the adjacent hostname-filtering primitive in AWS Network Firewall (different tradeoffs: middlebox vs on-host, SNI vs DNS, VPC-wide vs cGroup-scoped).

  • concepts/blast-radius — scope-of-damage concept; a circular dependency widens blast radius by coupling recovery to the same scope as the outage.
  • concepts/control-plane-data-plane-separation — the structural separation that prevents control-path outages from taking out data-path traffic; circular-dependency failure modes often violate this across a dogfooding boundary.
  • concepts/grey-failure — sibling pathology (a service partially working, cascading failures); circular deps are structurally deterministic where grey failure is probabilistic.

Seen in

  • sources/2026-05-05-airbnb-monitoring-reliably-at-scale — canonical observability-substrate instance on the wiki. Three Airbnb circular dependencies identified (compute, networking, meta-monitoring) and the corresponding remedies (dedicated clusters + custom L7 Envoy tier + dead-man's switch on AWS). Canonicalises the design bar: "treat monitoring as a production system whose availability must exceed that of what it observes."
  • sources/2026-04-16-github-ebpf-deployment-safety — the three-class taxonomy is from this post; GitHub's eBPF cgroup-scoped firewall is the enforcement primitive.

Sibling shape: bootstrapping circular dependency (cold-start)

A fourth structural variant — distinct from deployment, observability, and forecast contexts — is the bootstrapping circular dependency that manifests during full-region cold-start recovery. Meta's 2026-06-03 post names it explicitly:

The Twine orchestrator has a set of control plane services — Scheduler, Allocator, Broker, Zelos — "without which we cannot run or start any other services in the region." During regular operations the risk is low; during full-region bootstrap "the risk and impact are far higher." "It's a true chicken and egg problem." (Source: sources/2026-06-03-meta-lights-out-systems-on-validating-instant-power-loss-readiness)

The bootstrapping variant is the most severe: it occurs when the entire control plane is dead (not degraded, dead), so no workarounds using the partially-healthy system are possible. Meta's resolution is the belt-and-braces approach: CI/CD detection (Belljar) + runtime jump-start capability (Twrko).

Last updated · 542 distilled / 1,571 read