SYSTEM Cited by 8 sources
eBPF¶
eBPF (extended Berkeley Packet Filter) is a Linux kernel
subsystem that lets user-space load small, verifier-gated programs
that run in kernel context in response to system events (syscalls,
network packets, kprobes, uprobes, tracepoints, scheduler hooks, LSM
hooks, XDP, etc.). It replaces kernel-module development for most
observability, networking, and security use cases: no custom-compiled
kernel, no stability risk from a runaway module, safer than ptrace
or the Linux Audit framework for equivalent visibility.
Execution model¶
- Programs are attached to kernel hooks (kprobes, uprobes, tracepoints, XDP, TC classifiers, socket filters, cgroup attachments, LSM hooks, raw tracepoints).
- The verifier statically proves termination, memory-safety, and bounded complexity before loading — this is the load-bearing safety mechanism, and the primary source of cross-kernel variability.
- Programs can read/write eBPF maps — typed key/value data structures (hash, LRU hash, array, per-CPU array, ring buffer, task/inode storage, …) shared with user-space.
- Output to user-space is typically via ring buffer maps (or the older perf buffer); the consumer mmap's the ring and reads events.
Why production platforms adopt it¶
- Process + container context at syscall granularity — hooks
can read
task_struct, mount namespace, cgroup, etc., giving event-level attribution thatinotify/fanotifylack. - Performance. In-kernel execution avoids context switches
to user space for the common case (filter / drop / count);
generally lower overhead than
ptraceor audit. - Safety. Verifier + JIT sandbox replaces the correctness + ABI-stability risks of custom kernel modules.
- Unified visibility. Process + FS + network + LSM through
one mechanism — no need to combine
inotify+ Netlink + tracepoints separately. - Namespace / cgroup / container consistency — works uniformly across containerised workloads.
- BPF LSM enables mandatory-access-control enforcement, not just observation.
- Programmable data plane. User-space is the control plane that compiles rules into maps; eBPF programs + maps are the data plane (concepts/control-plane-data-plane-separation).
Portability: CO-RE¶
eBPF's biggest operational improvement has been
Compile Once – Run Everywhere: BTF-metadata-driven field-offset
patching at load time, letting one compiled program run across
kernels with different task_struct / sk_buff / etc. layouts.
When CO-RE isn't available (older kernels without BTF, some
distros), a fallback chain of runtime offset-guessing and
hard-coded offsets can carry support back to ~kernel 4.14.
XDP as DDoS data plane¶
A different production shape from the Datadog / GitHub stories above: Cloudflare uses XDP (eXpress Data Path) + eBPF as the per-server kernel drop plane for volumetric-DDoS mitigation in Magic Transit customers. The loop:
- XDP/eBPF program on the NIC's rx path samples packets into an eBPF map; samples are read by a user-space daemon.
- User-space dosd (denial-of-service daemon) analyses samples for packet-header commonalities / anomalies, enumerates fingerprint permutations, uses a data-streaming algorithm to pick the most-selective match.
- The winning fingerprint is compiled to an eBPF program and attached at XDP to drop matching packets before they enter the kernel stack — line-rate drop on modern NICs, at a per-packet cost close to the kernel's free-list write.
- Top fingerprints are gossiped/multicast across servers within a data centre and globally.
Scale: the 2025-06-20 writeup describes the 7.3 Tbps / 4.8 Bpps attack being autonomously dropped across 477 data centres / 293 locations via this pipeline. First wiki instance of XDP as a DDoS data plane (the Datadog / GitHub instances sit at syscall / cGroup hooks, not XDP).
Operating eBPF at scale is nontrivial¶
"Despite numerous claims that eBPF is safe, secure, and comes with negligible performance impact, the reality — especially at scale — is nuanced." — Datadog Workload Protection team, 5 years in (Source: sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security)
Six operational pitfall classes the Datadog post documents:
- Kernel-version portability — verifier evolution, helper availability, hook-point naming, inlining inconsistencies, dead-code elimination, map-operation restrictions. Mitigated by CI matrices across kernels/distros, CO-RE + fallbacks, shared lifecycle library (systems/ebpf-manager), minimum-viable hook-set gating at startup.
- Syscall-hook coverage. Compat syscalls (32-bit on 64-bit),
raw_tracepointsvs per-syscall tracepoints, syscall-number interpretation viathread_info,io_uringbypass of traditional syscall paths, new syscalls (e.g.openat2), exotic execution paths (binfmt_misc,call_usermodehelper, cgroup release agents, shebang interpreters). - Hooks failing to trigger.
kretprobemaxactivecaps, HW-interrupt preemption of kprobes, kernel-module lifecycle (hooks lost on module unload). Mitigation: prefer exported symbols, prefer entry over return probes, watch module lifecycle events. - Data-read correctness. Kernel-struct layout drift
(CO-RE), user-space memory reads fail silently (page faults
disabled in eBPF), TOCTOU between kernel copy and eBPF read,
path resolution races, non-linear
skbrequiringbpf_skb_pull_data. - eBPF map & cache pitfalls. Hashmap sizing, LRU semantics
("doesn't strictly adhere to traditional LRU"),
BPF_F_NO_PREALLOCunbounded footprint + Kubernetes OOM, blocking syscalls pinning map entries, lost / out-of-order events from perf buffers (one reader CPU), cache-key collisions (PID reuse, inode sharing via hard links, mount-namespace resolution). - Detection-rule correctness. Symlinks vs hard links,
interpreter visibility (
execve(*.py)runningpython), syscall args ≠ shell commands (env vars,$PATHresolution).
Constraints¶
- Verifier limits are sharper on older kernels — complex rule evaluation often has to stay in user-space on a large fraction of real fleets. This is why patterns/two-stage-evaluation (cheap kernel → rich user-space) is a recurring shape.
- Ring-buffer throughput can still be outpaced by event production, driving the need for concepts/in-kernel-filtering before emit.
- Map memory is a bounded resource — LRU eviction trades coverage for RAM predictability; non-preallocated maps trade memory predictability for flexibility (bad trade in Kubernetes with enforced cgroup limits).
Abuse / attack surface¶
eBPF's kernel access + map persistence + hook breadth make it a rootkit-capable mechanism when left unrestricted. Named incidents / PoCs / CVEs:
- ebpfkit (Datadog hackathon → BlackHat 2021 / DEF CON 29) — full eBPF rootkit: process hiding, network scanning, data exfiltration, C2, persistence.
- CVE-2023-2163, CVE-2024-41003 — real eBPF verifier exploits. The verifier is the last line of defence against unprivileged-eBPF kernel exploitation.
- Mitigations since:
bpf_probe_write_userblocked in Kernel Lockdown integrity mode (default on most distros now). - Hardening direction: Microsoft's Hornet LSM proposal for signed eBPF programs analogous to signed kernel modules.
Operational response (Datadog Workload Protection):
- Dedicated
bpfevent type in the agent capturing program loads, map ops, attachments fleet-wide. - Per-program helper + map inventory → detection rules flag
suspicious shapes (e.g. a network program sharing maps with a
file-system program, or use of
bpf_override_return). - Defensive research (BlackHat 2022 "Return to Sender") on protecting eBPF-based detections from malicious disablement.
Multi-tenancy with other eBPF tools¶
Shared kernel resources (TC priorities + handles, cgroup program
ordering, XDP slots, LSM hook chains) are effectively an
inter-vendor protocol. The 2022 Datadog × systems/cilium
outage — two independently-correct products colliding on TC handle
0:1, one of them cleaning up the other's filters — is the named
case study. The generalised lesson is
patterns/shared-kernel-resource-coordination: safer default
priorities, conservative cleanup that never auto-deletes shared
resources, and explicit vendor coordination.
Performance cost¶
Overhead depends heavily on:
- Hook type — uprobes ≫ kprobes (2 extra context switches);
raw tracepoints ≫ kprobes in efficiency (see Cloudflare's
ebpf_exporterbenchmark). - Map type —
BPF_MAP_TYPE_LRU_HASHneeds cross-CPU sync (slow);BPF_MAP_TYPE_PERCPU_ARRAYis CPU-local (fast). - Program complexity.
- Workload shape.
raw_syscallstracepoints notably affect connection-accept rates on edge nodes at Datadog scale.
cGroup-attached programs for process-set-scoped policy¶
A separate family of eBPF program types attaches at the Linux cGroup boundary rather than at host-wide hooks — making per-process-set policy enforceable without container isolation. Load-bearing types:
BPF_PROG_TYPE_CGROUP_SKB— egress (and ingress) packet filter scoped to a cGroup. Return 0 to drop, 1 to allow. Operates on IPs/ports, not hostnames.BPF_PROG_TYPE_CGROUP_SOCK_ADDR— hooks socketconnect4/connect6/bind/sendmsgsyscalls; can rewrite the destination IP + port before the kernel proceeds. Composes withCGROUP_SKBto build name-aware policy (rewrite DNS traffic to a userspace proxy + enforce IP-level drops based on what the proxy resolved).BPF_PROG_TYPE_CGROUP_SOCK, cGroup-scoped LSM hooks — similar granularity for socket-creation and mandatory-access- control checks.
Attached via bpf(BPF_PROG_ATTACH) or cilium/ebpf's
link.AttachCgroup with e.g. AttachCGroupInet4Connect /
AttachCGroupInetEgress. Load-bearing for workloads that need
policy tighter than the host but broader than the individual
process — GitHub's deployment-safety firewall
(Source: sources/2026-04-16-github-ebpf-deployment-safety)
is the canonical wiki instance of this shape (see
patterns/cgroup-scoped-egress-firewall +
patterns/dns-proxy-for-hostname-filtering). Different axis
from Datadog's syscall-hook / TC-classifier attachments.
Seen in¶
- sources/2025-11-18-datadog-ebpf-fim-filtering — File Integrity Monitoring hooks file syscalls, pushes events via ring buffer, filters ~94% of a ~10B/min event stream in-kernel using approver + discarder eBPF maps.
- sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security — 5-year operational retrospective on the full pitfall surface: verifier evolution, hook coverage, data-read consistency, cache pitfalls, rule-writing, eBPF as attack surface, multi-tenancy with systems/cilium, performance cost, safe rollout.
- sources/2025-06-20-cloudflare-how-cloudflare-blocked-a-monumental-7-3-tbps-ddos-attack — XDP + eBPF as the per-server kernel drop plane for Magic Transit DDoS mitigation; user-space systems/dosd compiles fingerprints into XDP programs, gossips across POPs; 7.3 Tbps autonomously mitigated across 477 data centres.
- sources/2026-04-16-github-ebpf-deployment-safety — GitHub
Engineering: cGroup-attached
CGROUP_SKB+CGROUP_SOCK_ADDRprograms implement per-deploy-script conditional network filtering to block circular dependencies on github.com without affecting customer traffic on the same host. DNS syscalls redirected to a userspace proxy (hostname-based blocklist); blocked queries attributed back to the calling process via a DNS transaction- ID → PID eBPF map. 6-month rollout. - sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf
— first wiki instance of eBPF at the Linux scheduler layer.
Netflix attaches
tp_btf/sched_wakeup+tp_btf/sched_switchtracepoints on Titus hosts, derives per-task run-queue latency in- kernel via a PID-keyedBPF_MAP_TYPE_HASH, tags samples with cgroup ID (read viabpf_rcu_read_lock/_unlockkfuncs — RCU-protected deref oftask->cgroups->dfl_cgrp->kn->id), rate-limits per-cgroup per-CPU via aBPF_MAP_TYPE_PERCPU_HASHchecked beforebpf_ringbuf_reserve(patterns/per-cgroup-rate-limiting-in-ebpf), and ships variable-length records viaBPF_MAP_TYPE_RINGBUFto a Go agent that emits Atlas percentile timers and preempt-cause- tagged counters — the patterns/dual-metric-disambiguation shape that distinguishes cross-cgroup noisy neighbors from self CFS-quota throttling (see concepts/cpu-throttling-vs-noisy-neighbor). Explicit eBPF-vs- kernel-module framing: "While implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility." Baseline disclosed:runq.latencyp99 ≈ 83.4 µs on an underloaded host. Canonical shape for the patterns/scheduler-tracepoint-based-monitoring pattern. - sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map — eBPF as the network-layer substrate of Netflix Service Topology. "We capture network flow records at the kernel level using eBPF technology — information about which services are connecting to which other services over the network. This gives us ground truth about actual network-level communication." The eBPF layer contributes the completeness property to the multi-source topology fusion (every service shows up regardless of instrumentation), with the corresponding limitation that "network-level information lacks application context." This use case sits on top of the FlowExporter /FlowCollector flow-attribution layer canonicalised in the 2025-04-08 post.
Related¶
- concepts/ebpf-verifier — the safety + variability mechanism
- concepts/in-kernel-filtering — the volume-reduction move
- concepts/linux-cgroup — process-set isolation unit used as attach point for security-policy eBPF programs
- concepts/control-plane-data-plane-separation — rule engine (control) / eBPF maps+programs (data)
- systems/co-re — portability across kernel layouts
- systems/ebpf-manager — Datadog's OSS lifecycle library
- systems/datadog-workload-protection — multi-product consumer
- systems/datadog-workload-protection-fim — FIM subsystem
- systems/cilium — eBPF CNI; multi-tenancy case study
- patterns/approver-discarder-filter — kernel-side compile-time+runtime dual
- patterns/two-stage-evaluation — cheap kernel → rich user-space
- patterns/shared-kernel-resource-coordination — multi-vendor eBPF coexistence
- patterns/cgroup-scoped-egress-firewall — cGroup-attached
CGROUP_SKB+CGROUP_SOCK_ADDRfor per-process-set network policy - patterns/dns-proxy-for-hostname-filtering — DNS syscall redirect + userspace proxy + TXID↔PID attribution
Seen in (continued)¶
- sources/2026-04-22-allthingsdistributed-invisible-engineering-behind-lambdas-network — first wiki instance of eBPF as the Lambda data-plane packet-rewrite substrate. Prior wiki framing of eBPF was observability-side (Datadog Workload Protection), deployment-safety side (GitHub), CNI/service-mesh side (Cilium), or DDoS-mitigation side (Cloudflare dosd/Magic Transit). This post is AWS's canonical disclosure of eBPF used for data-plane packet-header rewriting on micro-VM networking: (1) Geneve tunnel VNI rewrite on egress/ingress (patterns/ebpf-header-rewrite-on-egress, concepts/geneve-tunnel-vni) — tunnels pre-created with dummy VNIs, eBPF rewrites to real VNI once function init supplies it, latency drops 150 ms → 200 μs; (2) stateless NAT (concepts/stateless-nat-via-ebpf) — replaces iptables + conntrack dual-stage stateful NAT with eBPF programs that mangle headers from predetermined mappings, 100× setup-latency improvement at 4,000-VM-per-worker density. Names the build-vs-rewrite-kernel decision calculus explicitly: a custom Linux kernel driver was considered and rejected to avoid "maintaining Lambda-specific patches upstream indefinitely" (patterns/upstream-the-fix); eBPF chosen over DPDK on lower-overhead + in-kernel-integration axes. Cites Cilium as the at-scale existence proof that de-risked production adoption — Lambda was "among the first in Lambda to use it in production" with "real questions about whether it would hold up at scale and pass the security reviews." Tenth-ish eBPF-at-AWS instance on the wiki but the first one explicitly at the hypervisor-adjacent data-plane layer. Also pairs with the parallel-batched-attachment optimization disclosed under concepts/rtnl-lock-contention — attaching one eBPF program to N veth devices in a single operation instead of N lock-reacquire calls was a load-bearing part of the boot-time 4,000-slot pre-creation fix.
- sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs
— canonical eBPF-as-flow-logging-substrate + attribution-primitive
instance on the wiki. FlowExporter
sidecars attach to TCP tracepoints on
every Netflix AWS host and emit ~5M records/sec on socket close.
Local workload identity is resolved in-kernel via an
eBPF map populated
by IPManAgent on
Titus hosts (or a Metatron cert on EC2
hosts); a second
(local_ipv4, local_port) → workloadeBPF map disambiguates NAT64-free IPv6-to-IPv4 shared-IP sockets. Together with the 2024-09-11 noisy-neighbor run-queue-latency monitor, this cements Netflix's pattern of using eBPF + userspace-daemon + eBPF-map for per-socket and per-task workload attribution on Titus, with per-cgroup run-queue-latency and per-workload-TCP-flow-logs as the two canonical outputs. Complements Lambda's data-plane-mangling variant (2026-04-22) and the observability variants (Datadog, Cloudflare): Netflix's use case is TCP-lifecycle observation + local-identity-attribution, not packet mangling or rule-evaluation — and the backend pipeline is a -
2026-05-07 — Cloudflare Copy Fail Linux vulnerability response. Canonical wiki first instance of BPF-LSM (Linux Security Module hook) as a runtime CVE-mitigation substrate. Cloudflare's bpf-lsm framework attached an eBPF program to the
socket_bindLSM hook that deniesAF_ALGbinds for every caller except an explicit allow-list of legitimate binaries — surgical mitigation of CVE-2026-31431 (Copy Fail) without unloading the vulnerablealgif_aeadmodule (the researchers' recommendedmodprobe blacklistworkaround had failed in staging due to dependency conflicts — patterns/staging-caught-mitigation-failure). Rollout followed patterns/visibility-before-enforcement-rollout: Phase 1 pushedebpf_exporterconfig via salt to hook thesocket()syscall and emit per-binaryAF_ALGusage metrics across hundreds of thousands of servers; Phase 2 pushed the bpf-lsm enforcement program behind a separate gate once the allow-list was empirically validated. First-class distinct from the Datadog (observability), GitHub (deployment safety), Cloudflare Magic Transit (DDoS), AWS Lambda (data-plane mangling), and Netflix (flow logs + run-queue latency) eBPF instances already on this page: this is the BPF-LSM hook-denial shape at allowlist granularity. Canonical pairing with the LTS-kernel-backport- latency-gap concept — bpf-lsm is the runtime lever that covers the window between mainline fix and LTS backport. (Source: sources/2026-05-07-cloudflare-copy-fail-linux-vulnerability-response) heartbeat-based ownership map rather than an event stream. 40% → 0 Zuul misattribution over 2-week validation window is the load-bearing before/after evidence. -
sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — first canonical eBPF-as-fleet-profiling-substrate instance on the wiki. Meta's Strobelight orchestrates 42+ profilers (many eBPF-backed) on every Meta production host. Canonical quote: "Strobelight's profilers are often, but not exclusively, built using eBPF... it's hard to imagine how Strobelight would work without it." Distinct use- case axis from the prior wiki eBPF instances — this is eBPF for production profiling (CPU, memory via jemalloc, off-CPU, request-latency, language-event for Python/Java/Erlang, AI/GPU), completing the triad alongside eBPF-for-security (Datadog Workload Protection), eBPF-for-networking / data-plane (Cloudflare DDoS, Lambda Geneve/NAT, Fly.io Sprites). The specific eBPF features Meta relies on: custom actions at sample time (Strobemeta reads thread- local storage into each sample for request-context tagging), cheap ring-buffer-to-userspace plumbing (raw stacks written to disk off-host, then symbolized via the centralised symbolization service), and user-authored profilers via bpftrace ad-hoc scripts (patterns/ad-hoc-bpftrace-profiler) which drop new- profiler-lead-time from weeks to hours. Economic anchor: the continuous LBR profiler feeds the FDO pipeline → 10-20% fewer servers on Meta's top 200 services — the canonical wiki datum for "eBPF profiling pays for itself at hyperscale".