Skip to content

PATTERN Cited by 1 source

Fleet-wide retroactive threat hunt

Fleet-wide retroactive threat hunt is the pattern of, on any critical vulnerability disclosure, searching historical fleet-wide logs backwards in time for signs that the vulnerability was exploited before it was publicly known. The pattern runs in parallel with live-traffic detection and runtime mitigation; it's the forensic-evidence pillar of the assume-compromise posture. Canonical wiki articulation from the 2026-05-07 Copy Fail response post where Cloudflare searched 48 hours of fleet-wide kernel logs for the Copy Fail exploit's distinctive signature on 2026-04-30 03:14 UTC, hours after public disclosure.

Structure

The pattern has five investigative pillars, searched in parallel within the retroactive window:

  1. Kernel / application logs. The exploit's distinctive trace in kernel logs is searched across the fleet for a chosen window (Copy Fail case: 48 hours). Centralised logging is the substrate — without fleet-wide log aggregation, this pillar is infeasible at scale.
  2. Access logs. Who connected to potentially-affected systems, when, what commands they ran. Reconstructs the interactive activity timeline for the window, independent of automated exploit signatures.
  3. Binary integrity. Cryptographic hashes of system binaries compared against known-good package manifests. For page-cache-poisoning exploits (Copy Fail) this check is complementary: on-disk hashes remain clean because the tainting is in-memory only, so binary integrity is a hygienic check rather than the primary signal.
  4. Persistence audit. Look for common post-exploitation persistence (cron jobs, systemd units, shell profile modifications, loaded kernel modules, unusual setuid binaries, scheduled tasks).
  5. Network connection audit. Unusual egress, lateral-movement indicators, unexpected inbound connections, data-exfiltration signatures.

All five pillars must come back clean before asserting "no compromise". Any single pillar raising a signal triggers full incident-response escalation.

When it fits

  • A critical vulnerability has just been publicly disclosed (or is about to be).
  • The vulnerability could plausibly have been exploited by well-resourced adversaries before public disclosure.
  • The fleet has centralised fleet-wide logging with retention covering at least the retroactive window.
  • The exploit has a distinctive signature in at least one of the five pillars (most do: kernel logs, access patterns, process-chain anomalies).
  • The team has dedicated security engineers to run the investigation within hours.

When it doesn't fit

  • No centralised logging. Without fleet-wide log aggregation, retroactive search is infeasible at scale.
  • Short log retention. If logs have rotated out before the search starts, the retroactive window is zero.
  • Silent exploits with no log trace. Exploits that leave no kernel or application-log signature are invisible to pillars 1 and 2; the investigation must rely on pillars 3–5.
  • Time-bounded incident. If the incident is actively ongoing and mitigation cycles are more urgent than historical forensics, the retrospective pillar can wait hours until the containment workstream stabilises.

Structural properties

  • Runs in parallel with other incident-response workstreams. Blast-radius mapping, detection coverage validation, runtime mitigation engineering, patched-kernel rollout — and the retroactive threat hunt. Each is a distinct team / workstream.
  • Window length is a parameter. Cloudflare chose 48 hours for Copy Fail; longer windows cover more potential pre-disclosure exploitation but require proportional log retention and search cost.
  • Fleet-wide aggregation is load-bearing. Per-host log search doesn't scale to 100,000+ servers. The pillars rely on centralised logging infrastructure (e.g. splunk / elastic / ClickHouse / big-data platforms).
  • Complementary to live detection, not substitutive. Live behavioural detection runs continuously and catches new exploitation as it happens. The retroactive hunt catches pre-detection or pre-disclosure exploitation that slipped through. Both are needed.

Canonical instance: Copy Fail 48-hour hunt (2026-04-30 03:14 UTC)

  • Trigger: Copy Fail disclosed 2026-04-29 16:00 UTC.
  • Declaration: Security incident declared 2026-04-30 03:14 UTC. Threat-hunt workstream begins.
  • Window: 48 hours retroactive. Covers period before public disclosure — the threat-model concern being that well-resourced adversaries may have known about the vulnerability pre-disclosure.
  • Pillar 1 — Kernel logs: The exploit leaves a distinctive trace in kernel logs when it runs. Search fleet-wide centralised logs across the 48-hour window.
  • Pillar 2 — Access logs: Pulled for affected systems; reconstructed who connected, when, what commands they ran.
  • Pillar 3 — Binary integrity: System binaries (e.g. /usr/bin/su, the canonical Copy Fail target) validated against known-good package manifests.
  • Pillar 4 — Persistence audit: Looked for common post-exploitation persistence mechanisms.
  • Pillar 5 — Network audit: Audited network connections for anything unusual.
  • Result: "Everything was clean." Canonical wiki instance of a clean result on all five pillars.

Failure modes

  • Window too short. An adversary who exploited the vulnerability more than 48 hours before disclosure escapes the search. Extending the window multiplies search cost and requires log retention to match.
  • Log retention gap. If logs rotate out before the search starts, pillars 1 and 2 have nothing to search.
  • Signature too broad. Searching for too-generic indicators ("any AF_ALG socket open") surfaces huge false-positive counts and drowns signal in noise. Prefer tight signatures from the specific exploit disclosure.
  • Signature too narrow. A signature tied to one exploit variant may miss adapted variants. The behavioural-detection pillar provides the complementary variant-agnostic coverage.
  • Centralised logging bottleneck. At fleet scale, retroactive search itself can stress the logging infrastructure. Schedule with awareness of the logging system's capacity.

Sibling patterns

  • Assume- compromise posture — the standing attitude the threat hunt executes on. The pattern is the posture's operational realisation.
  • Behavioral detection — live-traffic counterpart. Runs continuously; catches new exploitation. The retroactive hunt catches what happened before live detection fired.
  • Log-based forensic analysis — general incident-response primitive. The retroactive threat hunt is a specific structured application tied to assume-compromise posture + critical vulnerability disclosure.

Seen in

  • 2026-05-07 — Cloudflare Copy Fail response. Canonical wiki first-class page. Security incident declared 2026-04-30 03:14 UTC, hours after Copy Fail disclosure. 48-hour window across fleet-wide kernel logs searched for the exploit's distinctive trace; access logs reconstructed; binary integrity validated; persistence mechanisms audited; network connections audited. Result: "Everything was clean." Load-bearing framing: "Our security team operates on a simple principle for critical vulnerabilities: assume compromise until you can prove otherwise." (Source: sources/2026-05-07-cloudflare-copy-fail-linux-vulnerability-response)
Last updated · 451 distilled / 1,324 read