PATTERN Cited by 1 source

Visibility before enforcement rollout¶

Visibility before enforcement rollout is a two-gate deployment discipline for rolling out any runtime enforcement mechanism — a firewall rule, an LSM hook denial, a seccomp filter, an access control — that depends on an allow-list of legitimate callers. The first gate ships measurement only: deploy observability that emits per-caller usage metrics across the fleet. The second gate — behind a separate deployment flag — ships the enforcement program, now parameterised by the empirically-validated allow-list. Canonical wiki articulation from the 2026-05-07 Copy Fail response post where Cloudflare used the pattern to roll out a bpf-lsm program denying AF_ALG socket binds without breaking internal services that legitimately use the kernel crypto API.

Structure¶

  Phase 1: Visibility
  ┌──────────────────────────────────┐
  │  Deploy measurement tool         │
  │  (eBPF exporter, audit log,      │
  │   monitoring mode rule, etc.)    │
  │                                  │
  │  Emit per-caller metrics         │
  │  Aggregate across fleet          │
  │  Validate allow-list empirically │
  └──────────────────────────────────┘
                 │
                 │  separate deployment gate
                 ▼
  Phase 2: Enforcement
  ┌──────────────────────────────────┐
  │  Deploy enforcement program      │
  │  parameterised by validated      │
  │  allow-list                      │
  │                                  │
  │  Default-deny for non-allow-     │
  │  listed; allow for allow-listed  │
  └──────────────────────────────────┘

Each phase rolls back independently. Phase 1 has effectively zero failure mode beyond metrics-overhead cost. Phase 2's failure mode is bounded by the allow-list validated in Phase 1.

When it fits¶

Rolling out an enforcement mechanism (denial, block, drop) against a subsystem (kernel API, network port, syscall, filesystem path, auth endpoint) with a finite known set of legitimate callers.
You have a plausible allow-list but aren't certain it's exhaustive.
The cost of breaking a legitimate-but-unknown caller is high (internal outage, customer impact) and the cost of running measurement first is low.
An observability primitive exists that can measure usage per caller without changing behaviour (eBPF, audit logs, reverse-proxy logs, OS security events).
Scale is too large for manual per-caller verification — fleet-wide aggregation is required.

When it doesn't fit¶

Active exploitation is ongoing. If a vulnerability is being actively exploited, enforcement has to ship first; measurement-first is a luxury only available when the attacker hasn't already found the door.
Zero known legitimate users. If the subsystem is expected to have no legitimate callers, there's no allow-list to validate; enforcement can ship directly with an empty allow-list (but monitoring-mode is still a wise safety net).
Measurement itself is expensive or risky. If the observability primitive itself has non-trivial kernel or process overhead, the measurement phase isn't free and the cost-benefit shifts.
Workload is short-tailed. If the fleet runs only a handful of well-understood services, manual verification may be cheaper than building measurement tooling.

Structural properties¶

Two independent deployment gates. Phase 1 and Phase 2 use distinct config keys / feature flags / deployment mechanisms. Rolling out Phase 2 is a separate approval.
Phase 1 produces empirical evidence for Phase 2's allow-list. The measurement data is the validation artifact — not a design document, not an architect's intuition, but per-caller usage metrics aggregated across the fleet.
Phase 1 runs silently. No enforcement behaviour changes during Phase 1; legitimate callers see no difference. The only delta is metrics-endpoint load.
Phase 2 can iterate on the allow-list. If post-Phase-2 a new legitimate caller is discovered (e.g. via Phase 1 metrics continuing to fire after enforcement), the allow-list can be updated without redeploying the enforcement mechanism.

Canonical instance: Copy Fail bpf-lsm rollout (2026-04-30)¶

From the Cloudflare 2026-05-07 post, verbatim:

"So the bpf-lsm rollout was deliberately staged in two steps:

Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating AF_ALG sockets.*

Then enforce. Push the bpf-lsm program behind a separate enforcement gate."*

Phase 1 (2026-04-30 afternoon). ebpf_exporter config pushed via salt, hooks socket() syscall, emits per-binary AF_ALG socket-creation metric. Aggregated across hundreds of thousands of servers within hours. Confirmed the one known internal service was the sole legitimate AF_ALG user.
Phase 2 (2026-04-30 evening). bpf-lsm program pushed behind a separate enforcement gate. Denies socket_bind for AF_ALG unless the calling binary is on the allow-list. End-to-end verification on a previously-vulnerable test node confirmed the exploit no longer works (patterns/bpf-lsm-allowlist-hook-denial).

Failure modes¶

Phase 1 data is incomplete. Short measurement windows can miss low-frequency legitimate callers; run measurement for long enough to catch the natural periodicity of the workload (batch jobs, weekly processes).
Allow-list ages. New legitimate callers added over time may not be on the allow-list. Keep the measurement running even after enforcement lands so new callers surface as anomalies, and have a fast path to update the allow-list.
Allow-list too broad. If the allow-list is binary-path-based, an attacker-controlled process that renames itself to an allow-listed path bypasses the denial. Combine with additional validation (signature, cgroup, UID) for high-value cases.
Phase 1 / Phase 2 drift. If the measurement tool emits metrics that can't be easily correlated with the enforcement tool's allow-list semantics, the empirical validation is weakened. Prefer tools that share identity primitives (binary path, UID, cgroup).

Sibling patterns¶

Logging mode → enforcement mode — same discipline for WAF / firewall rules. Rule ships first in log-only mode; after log review validates no false positives, flips to enforce.
Data-driven allow-list monitoring mode — same discipline applied at the allow-list-building altitude.
patterns/staged-rollout — general staged deployment; visibility-before-enforcement is a specific case with explicit measurement semantics.
Expand-migrate-contract — schema-migration analogue. Expand the schema first (allow both old and new), migrate callers, contract the schema once callers are off the old shape. Same spirit (safe intermediate state).

Seen in¶

2026-05-07 — Cloudflare Copy Fail response. Canonical wiki first-class page. Two-gate bpf-lsm rollout: ebpf_exporter config via salt in Phase 1 confirmed the allow-list empirically, then bpf-lsm program pushed behind a separate enforcement gate in Phase 2. "Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages." (Source: sources/2026-05-07-cloudflare-copy-fail-linux-vulnerability-response)