Skip to content

PATTERN Cited by 1 source

Visibility before enforcement rollout

Visibility before enforcement rollout is a two-gate deployment discipline for rolling out any runtime enforcement mechanism — a firewall rule, an LSM hook denial, a seccomp filter, an access control — that depends on an allow-list of legitimate callers. The first gate ships measurement only: deploy observability that emits per-caller usage metrics across the fleet. The second gate — behind a separate deployment flag — ships the enforcement program, now parameterised by the empirically-validated allow-list. Canonical wiki articulation from the 2026-05-07 Copy Fail response post where Cloudflare used the pattern to roll out a bpf-lsm program denying AF_ALG socket binds without breaking internal services that legitimately use the kernel crypto API.

Structure

  Phase 1: Visibility
  ┌──────────────────────────────────┐
  │  Deploy measurement tool         │
  │  (eBPF exporter, audit log,      │
  │   monitoring mode rule, etc.)    │
  │                                  │
  │  Emit per-caller metrics         │
  │  Aggregate across fleet          │
  │  Validate allow-list empirically │
  └──────────────────────────────────┘
                 │  separate deployment gate
  Phase 2: Enforcement
  ┌──────────────────────────────────┐
  │  Deploy enforcement program      │
  │  parameterised by validated      │
  │  allow-list                      │
  │                                  │
  │  Default-deny for non-allow-     │
  │  listed; allow for allow-listed  │
  └──────────────────────────────────┘

Each phase rolls back independently. Phase 1 has effectively zero failure mode beyond metrics-overhead cost. Phase 2's failure mode is bounded by the allow-list validated in Phase 1.

When it fits

  • Rolling out an enforcement mechanism (denial, block, drop) against a subsystem (kernel API, network port, syscall, filesystem path, auth endpoint) with a finite known set of legitimate callers.
  • You have a plausible allow-list but aren't certain it's exhaustive.
  • The cost of breaking a legitimate-but-unknown caller is high (internal outage, customer impact) and the cost of running measurement first is low.
  • An observability primitive exists that can measure usage per caller without changing behaviour (eBPF, audit logs, reverse-proxy logs, OS security events).
  • Scale is too large for manual per-caller verification — fleet-wide aggregation is required.

When it doesn't fit

  • Active exploitation is ongoing. If a vulnerability is being actively exploited, enforcement has to ship first; measurement-first is a luxury only available when the attacker hasn't already found the door.
  • Zero known legitimate users. If the subsystem is expected to have no legitimate callers, there's no allow-list to validate; enforcement can ship directly with an empty allow-list (but monitoring-mode is still a wise safety net).
  • Measurement itself is expensive or risky. If the observability primitive itself has non-trivial kernel or process overhead, the measurement phase isn't free and the cost-benefit shifts.
  • Workload is short-tailed. If the fleet runs only a handful of well-understood services, manual verification may be cheaper than building measurement tooling.

Structural properties

  • Two independent deployment gates. Phase 1 and Phase 2 use distinct config keys / feature flags / deployment mechanisms. Rolling out Phase 2 is a separate approval.
  • Phase 1 produces empirical evidence for Phase 2's allow-list. The measurement data is the validation artifact — not a design document, not an architect's intuition, but per-caller usage metrics aggregated across the fleet.
  • Phase 1 runs silently. No enforcement behaviour changes during Phase 1; legitimate callers see no difference. The only delta is metrics-endpoint load.
  • Phase 2 can iterate on the allow-list. If post-Phase-2 a new legitimate caller is discovered (e.g. via Phase 1 metrics continuing to fire after enforcement), the allow-list can be updated without redeploying the enforcement mechanism.

Canonical instance: Copy Fail bpf-lsm rollout (2026-04-30)

From the Cloudflare 2026-05-07 post, verbatim:

"So the bpf-lsm rollout was deliberately staged in two steps:

    1. Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating AF_ALG sockets.*
    1. Then enforce. Push the bpf-lsm program behind a separate enforcement gate."*
  • Phase 1 (2026-04-30 afternoon). ebpf_exporter config pushed via salt, hooks socket() syscall, emits per-binary AF_ALG socket-creation metric. Aggregated across hundreds of thousands of servers within hours. Confirmed the one known internal service was the sole legitimate AF_ALG user.
  • Phase 2 (2026-04-30 evening). bpf-lsm program pushed behind a separate enforcement gate. Denies socket_bind for AF_ALG unless the calling binary is on the allow-list. End-to-end verification on a previously-vulnerable test node confirmed the exploit no longer works (patterns/bpf-lsm-allowlist-hook-denial).

Failure modes

  • Phase 1 data is incomplete. Short measurement windows can miss low-frequency legitimate callers; run measurement for long enough to catch the natural periodicity of the workload (batch jobs, weekly processes).
  • Allow-list ages. New legitimate callers added over time may not be on the allow-list. Keep the measurement running even after enforcement lands so new callers surface as anomalies, and have a fast path to update the allow-list.
  • Allow-list too broad. If the allow-list is binary-path-based, an attacker-controlled process that renames itself to an allow-listed path bypasses the denial. Combine with additional validation (signature, cgroup, UID) for high-value cases.
  • Phase 1 / Phase 2 drift. If the measurement tool emits metrics that can't be easily correlated with the enforcement tool's allow-list semantics, the empirical validation is weakened. Prefer tools that share identity primitives (binary path, UID, cgroup).

Sibling patterns

  • Logging mode → enforcement mode — same discipline for WAF / firewall rules. Rule ships first in log-only mode; after log review validates no false positives, flips to enforce.
  • Data-driven allow-list monitoring mode — same discipline applied at the allow-list-building altitude.
  • patterns/staged-rollout — general staged deployment; visibility-before-enforcement is a specific case with explicit measurement semantics.
  • Expand-migrate-contract — schema-migration analogue. Expand the schema first (allow both old and new), migrate callers, contract the schema once callers are off the old shape. Same spirit (safe intermediate state).

Seen in

  • 2026-05-07 — Cloudflare Copy Fail response. Canonical wiki first-class page. Two-gate bpf-lsm rollout: ebpf_exporter config via salt in Phase 1 confirmed the allow-list empirically, then bpf-lsm program pushed behind a separate enforcement gate in Phase 2. "Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages." (Source: sources/2026-05-07-cloudflare-copy-fail-linux-vulnerability-response)
Last updated · 451 distilled / 1,324 read