Skip to content

PATTERN Cited by 1 source

Autonomous distributed mitigation

Autonomous distributed mitigation is the architectural posture of running threat detection + threat mitigation on every node of a distributed fleet, letting each node act independently with no human in the loop, rather than funneling traffic or decisions through a central scrubbing / analytics / review tier.

Contrast two shapes:

  • Centralised scrubbing: traffic is diverted to a dedicated scrubbing facility that runs the detection and mitigation logic; ingress capacity is the bottleneck; human operators review alerts; failure of the scrubbing tier = failure of mitigation.
  • Autonomous distributed mitigation: every edge node runs the full detection + mitigation loop; packets are dropped at the node that received them; no central tier to target; no human review needed for the common case.

The canonical wiki instance is Cloudflare's 7.3 Tbps DDoS writeup: "Our systems successfully blocked this record-breaking 7.3 Tbps DDoS attack fully autonomously without requiring any human intervention, without triggering any alerts, and without causing any incidents" — across 477 data centres / 293 locations.

Structural recipe

  1. Anycast as the delivery fabric — ensures traffic lands where capacity is, not where a scrubber is. An attacker can't beat your POP density by being geographically diverse.
  2. Every service on every node — not a specialised DDoS- scrubbing fleet; the same servers that serve legitimate traffic also run the DDoS engine. Capacity is shared; failure modes are shared.
  3. Kernel-data-plane drop — mitigation action executes at XDP/eBPF (or equivalent) line-rate. The cost of running detection on every packet on every node is tolerable only if mitigation is cheap.
  4. Heuristic + streaming detection — pattern generation runs in user space (systems/dosd) with full streaming algorithms; the kernel filter is compiled from the user- space decision (see patterns/two-stage-evaluation).
  5. Peer gossip for shared intelligence — top fingerprints propagate via gossip/multicast so every node benefits from every other node's samples (patterns/gossip-fingerprint-propagation). This avoids turning into N independent detectors re-solving the same problem.
  6. Auto-expiry — mitigation rules time out when hits decay; no operator cleanup.
  7. Customer surface as managed rulesets — customers tune sensitivity; they do not author the kernel programs. The system is an abstraction, not a toolkit.

Why the defender wins

  • No central target. There is nothing to DDoS: every server is a scrubbing service, so you'd have to flood the entire fleet's capacity — which is, by construction, what the fleet is sized for.
  • No human-operator bottleneck. A 45-second attack doesn't wait for an on-call rotation. At scale, alerting humans per attack would make the SOC useless within a day (477 POPs × many attacks/day = infinite pager noise). Fully autonomous response is the only scalable posture.
  • Observability-free common case. "No alerts were triggered, no incidents caused" — the attack is a non-event. Humans look at retrospective aggregates (threat reports, customer dashboards) but not at per-attack paging.
  • Blast radius per node. A misfire (overly broad fingerprint) affects at most one node's traffic until gossip converges or the rule expires; not the entire fleet.

Where it works

  • DDoS mitigation at CDN scale — the Cloudflare instance. Same shape available on any anycast CDN with edge code- deployability (Fastly Compute@Edge, AWS Shield on CloudFront, etc., though public internals vary).
  • Bot detection at edge — same topology, different detection heuristics.
  • Rate-limiting / WAF — every edge POP enforces locally, central only stores the rule set.
  • Plausible-but-less-documented: anomaly detection on CDN metrics / logs where the action is "tag the response" rather than "drop the packet".

Where it doesn't work

  • Stateful attacks that need cross-POP correlation. A slow-rate low-and-slow attack that spreads probes across many POPs below per-POP thresholds requires a central correlator. Autonomous per-node mitigation misses these; gossip helps but doesn't close the gap.
  • Application-layer attacks needing full request context. L7 bot detection often needs user-agent / behaviour history that doesn't fit in XDP programs; moves to a user-space proxy (systems/pingora) that is still per-POP but heavier-weight.
  • Environments without anycast. If the ingress topology concentrates traffic (single-region cloud deployments), the pattern's anycast premise is gone; centralised scrubbing is the natural alternative.

Constraints / risks

  • Fingerprint-compile latency is the attack-onset-to-drop budget. If compile+deploy takes seconds, the first seconds of an attack may reach origin. Cloudflare doesn't publish this number; "no incidents caused" implies it's well under the attack's 45-second duration.
  • False positives at the edge are hard to audit. Packets dropped in XDP are gone; no user-space log, by design (cost reasons). Retrospective analysis relies on sampled counters / fingerprint hit rates, not on captured packets.
  • Gossip convergence = fleet consistency window. Fingerprints derived at one POP lag at peer POPs. An attack moving faster than gossip convergence effectively gets re-discovered at each POP.
  • Heuristic escape hatch. A sufficiently-randomised attacker could starve the permutation search — mitigation efficacy is not a guarantee, just a high-90%s heuristic floor.

Seen in

Last updated · 200 distilled / 1,178 read