PATTERN Cited by 1 source
Visibility before enforcement rollout¶
Visibility before enforcement rollout is a two-gate deployment discipline for rolling out any runtime enforcement mechanism — a firewall rule, an LSM hook denial, a seccomp filter, an access control — that depends on an allow-list of legitimate callers. The first gate ships measurement only: deploy observability that emits per-caller usage metrics across the fleet. The second gate — behind a separate deployment flag — ships the enforcement program, now parameterised by the empirically-validated allow-list. Canonical wiki articulation from the 2026-05-07 Copy Fail response post where Cloudflare used the pattern to roll out a bpf-lsm program denying AF_ALG socket binds without breaking internal services that legitimately use the kernel crypto API.
Structure¶
Phase 1: Visibility
┌──────────────────────────────────┐
│ Deploy measurement tool │
│ (eBPF exporter, audit log, │
│ monitoring mode rule, etc.) │
│ │
│ Emit per-caller metrics │
│ Aggregate across fleet │
│ Validate allow-list empirically │
└──────────────────────────────────┘
│
│ separate deployment gate
▼
Phase 2: Enforcement
┌──────────────────────────────────┐
│ Deploy enforcement program │
│ parameterised by validated │
│ allow-list │
│ │
│ Default-deny for non-allow- │
│ listed; allow for allow-listed │
└──────────────────────────────────┘
Each phase rolls back independently. Phase 1 has effectively zero failure mode beyond metrics-overhead cost. Phase 2's failure mode is bounded by the allow-list validated in Phase 1.
When it fits¶
- Rolling out an enforcement mechanism (denial, block, drop) against a subsystem (kernel API, network port, syscall, filesystem path, auth endpoint) with a finite known set of legitimate callers.
- You have a plausible allow-list but aren't certain it's exhaustive.
- The cost of breaking a legitimate-but-unknown caller is high (internal outage, customer impact) and the cost of running measurement first is low.
- An observability primitive exists that can measure usage per caller without changing behaviour (eBPF, audit logs, reverse-proxy logs, OS security events).
- Scale is too large for manual per-caller verification — fleet-wide aggregation is required.
When it doesn't fit¶
- Active exploitation is ongoing. If a vulnerability is being actively exploited, enforcement has to ship first; measurement-first is a luxury only available when the attacker hasn't already found the door.
- Zero known legitimate users. If the subsystem is expected to have no legitimate callers, there's no allow-list to validate; enforcement can ship directly with an empty allow-list (but monitoring-mode is still a wise safety net).
- Measurement itself is expensive or risky. If the observability primitive itself has non-trivial kernel or process overhead, the measurement phase isn't free and the cost-benefit shifts.
- Workload is short-tailed. If the fleet runs only a handful of well-understood services, manual verification may be cheaper than building measurement tooling.
Structural properties¶
- Two independent deployment gates. Phase 1 and Phase 2 use distinct config keys / feature flags / deployment mechanisms. Rolling out Phase 2 is a separate approval.
- Phase 1 produces empirical evidence for Phase 2's allow-list. The measurement data is the validation artifact — not a design document, not an architect's intuition, but per-caller usage metrics aggregated across the fleet.
- Phase 1 runs silently. No enforcement behaviour changes during Phase 1; legitimate callers see no difference. The only delta is metrics-endpoint load.
- Phase 2 can iterate on the allow-list. If post-Phase-2 a new legitimate caller is discovered (e.g. via Phase 1 metrics continuing to fire after enforcement), the allow-list can be updated without redeploying the enforcement mechanism.
Canonical instance: Copy Fail bpf-lsm rollout (2026-04-30)¶
From the Cloudflare 2026-05-07 post, verbatim:
"So the bpf-lsm rollout was deliberately staged in two steps:
- Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating
AF_ALGsockets.*
- Then enforce. Push the bpf-lsm program behind a separate enforcement gate."*
- Phase 1 (2026-04-30 afternoon).
ebpf_exporterconfig pushed via salt, hookssocket()syscall, emits per-binaryAF_ALGsocket-creation metric. Aggregated across hundreds of thousands of servers within hours. Confirmed the one known internal service was the sole legitimate AF_ALG user. - Phase 2 (2026-04-30 evening).
bpf-lsm program
pushed behind a separate enforcement gate.
Denies
socket_bindfor AF_ALG unless the calling binary is on the allow-list. End-to-end verification on a previously-vulnerable test node confirmed the exploit no longer works (patterns/bpf-lsm-allowlist-hook-denial).
Failure modes¶
- Phase 1 data is incomplete. Short measurement windows can miss low-frequency legitimate callers; run measurement for long enough to catch the natural periodicity of the workload (batch jobs, weekly processes).
- Allow-list ages. New legitimate callers added over time may not be on the allow-list. Keep the measurement running even after enforcement lands so new callers surface as anomalies, and have a fast path to update the allow-list.
- Allow-list too broad. If the allow-list is binary-path-based, an attacker-controlled process that renames itself to an allow-listed path bypasses the denial. Combine with additional validation (signature, cgroup, UID) for high-value cases.
- Phase 1 / Phase 2 drift. If the measurement tool emits metrics that can't be easily correlated with the enforcement tool's allow-list semantics, the empirical validation is weakened. Prefer tools that share identity primitives (binary path, UID, cgroup).
Sibling patterns¶
- Logging mode → enforcement mode — same discipline for WAF / firewall rules. Rule ships first in log-only mode; after log review validates no false positives, flips to enforce.
- Data-driven allow-list monitoring mode — same discipline applied at the allow-list-building altitude.
- patterns/staged-rollout — general staged deployment; visibility-before-enforcement is a specific case with explicit measurement semantics.
- Expand-migrate-contract — schema-migration analogue. Expand the schema first (allow both old and new), migrate callers, contract the schema once callers are off the old shape. Same spirit (safe intermediate state).
Seen in¶
- 2026-05-07 — Cloudflare Copy Fail response.
Canonical wiki first-class page. Two-gate bpf-lsm
rollout:
ebpf_exporterconfig via salt in Phase 1 confirmed the allow-list empirically, then bpf-lsm program pushed behind a separate enforcement gate in Phase 2. "Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages." (Source: sources/2026-05-07-cloudflare-copy-fail-linux-vulnerability-response)