SYSTEM Cited by 4 sources
eBPF¶
eBPF (extended Berkeley Packet Filter) is a Linux kernel
subsystem that lets user-space load small, verifier-gated programs
that run in kernel context in response to system events (syscalls,
network packets, kprobes, uprobes, tracepoints, scheduler hooks, LSM
hooks, XDP, etc.). It replaces kernel-module development for most
observability, networking, and security use cases: no custom-compiled
kernel, no stability risk from a runaway module, safer than ptrace
or the Linux Audit framework for equivalent visibility.
Execution model¶
- Programs are attached to kernel hooks (kprobes, uprobes, tracepoints, XDP, TC classifiers, socket filters, cgroup attachments, LSM hooks, raw tracepoints).
- The verifier statically proves termination, memory-safety, and bounded complexity before loading — this is the load-bearing safety mechanism, and the primary source of cross-kernel variability.
- Programs can read/write eBPF maps — typed key/value data structures (hash, LRU hash, array, per-CPU array, ring buffer, task/inode storage, …) shared with user-space.
- Output to user-space is typically via ring buffer maps (or the older perf buffer); the consumer mmap's the ring and reads events.
Why production platforms adopt it¶
- Process + container context at syscall granularity — hooks
can read
task_struct, mount namespace, cgroup, etc., giving event-level attribution thatinotify/fanotifylack. - Performance. In-kernel execution avoids context switches
to user space for the common case (filter / drop / count);
generally lower overhead than
ptraceor audit. - Safety. Verifier + JIT sandbox replaces the correctness + ABI-stability risks of custom kernel modules.
- Unified visibility. Process + FS + network + LSM through
one mechanism — no need to combine
inotify+ Netlink + tracepoints separately. - Namespace / cgroup / container consistency — works uniformly across containerised workloads.
- BPF LSM enables mandatory-access-control enforcement, not just observation.
- Programmable data plane. User-space is the control plane that compiles rules into maps; eBPF programs + maps are the data plane (concepts/control-plane-data-plane-separation).
Portability: CO-RE¶
eBPF's biggest operational improvement has been
Compile Once – Run Everywhere: BTF-metadata-driven field-offset
patching at load time, letting one compiled program run across
kernels with different task_struct / sk_buff / etc. layouts.
When CO-RE isn't available (older kernels without BTF, some
distros), a fallback chain of runtime offset-guessing and
hard-coded offsets can carry support back to ~kernel 4.14.
XDP as DDoS data plane¶
A different production shape from the Datadog / GitHub stories above: Cloudflare uses XDP (eXpress Data Path) + eBPF as the per-server kernel drop plane for volumetric-DDoS mitigation in Magic Transit customers. The loop:
- XDP/eBPF program on the NIC's rx path samples packets into an eBPF map; samples are read by a user-space daemon.
- User-space dosd (denial-of-service daemon) analyses samples for packet-header commonalities / anomalies, enumerates fingerprint permutations, uses a data-streaming algorithm to pick the most-selective match.
- The winning fingerprint is compiled to an eBPF program and attached at XDP to drop matching packets before they enter the kernel stack — line-rate drop on modern NICs, at a per-packet cost close to the kernel's free-list write.
- Top fingerprints are gossiped/multicast across servers within a data centre and globally.
Scale: the 2025-06-20 writeup describes the 7.3 Tbps / 4.8 Bpps attack being autonomously dropped across 477 data centres / 293 locations via this pipeline. First wiki instance of XDP as a DDoS data plane (the Datadog / GitHub instances sit at syscall / cGroup hooks, not XDP).
Operating eBPF at scale is nontrivial¶
"Despite numerous claims that eBPF is safe, secure, and comes with negligible performance impact, the reality — especially at scale — is nuanced." — Datadog Workload Protection team, 5 years in (Source: sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security)
Six operational pitfall classes the Datadog post documents:
- Kernel-version portability — verifier evolution, helper availability, hook-point naming, inlining inconsistencies, dead-code elimination, map-operation restrictions. Mitigated by CI matrices across kernels/distros, CO-RE + fallbacks, shared lifecycle library (systems/ebpf-manager), minimum-viable hook-set gating at startup.
- Syscall-hook coverage. Compat syscalls (32-bit on 64-bit),
raw_tracepointsvs per-syscall tracepoints, syscall-number interpretation viathread_info,io_uringbypass of traditional syscall paths, new syscalls (e.g.openat2), exotic execution paths (binfmt_misc,call_usermodehelper, cgroup release agents, shebang interpreters). - Hooks failing to trigger.
kretprobemaxactivecaps, HW-interrupt preemption of kprobes, kernel-module lifecycle (hooks lost on module unload). Mitigation: prefer exported symbols, prefer entry over return probes, watch module lifecycle events. - Data-read correctness. Kernel-struct layout drift
(CO-RE), user-space memory reads fail silently (page faults
disabled in eBPF), TOCTOU between kernel copy and eBPF read,
path resolution races, non-linear
skbrequiringbpf_skb_pull_data. - eBPF map & cache pitfalls. Hashmap sizing, LRU semantics
("doesn't strictly adhere to traditional LRU"),
BPF_F_NO_PREALLOCunbounded footprint + Kubernetes OOM, blocking syscalls pinning map entries, lost / out-of-order events from perf buffers (one reader CPU), cache-key collisions (PID reuse, inode sharing via hard links, mount-namespace resolution). - Detection-rule correctness. Symlinks vs hard links,
interpreter visibility (
execve(*.py)runningpython), syscall args ≠ shell commands (env vars,$PATHresolution).
Constraints¶
- Verifier limits are sharper on older kernels — complex rule evaluation often has to stay in user-space on a large fraction of real fleets. This is why patterns/two-stage-evaluation (cheap kernel → rich user-space) is a recurring shape.
- Ring-buffer throughput can still be outpaced by event production, driving the need for concepts/in-kernel-filtering before emit.
- Map memory is a bounded resource — LRU eviction trades coverage for RAM predictability; non-preallocated maps trade memory predictability for flexibility (bad trade in Kubernetes with enforced cgroup limits).
Abuse / attack surface¶
eBPF's kernel access + map persistence + hook breadth make it a rootkit-capable mechanism when left unrestricted. Named incidents / PoCs / CVEs:
- ebpfkit (Datadog hackathon → BlackHat 2021 / DEF CON 29) — full eBPF rootkit: process hiding, network scanning, data exfiltration, C2, persistence.
- CVE-2023-2163, CVE-2024-41003 — real eBPF verifier exploits. The verifier is the last line of defence against unprivileged-eBPF kernel exploitation.
- Mitigations since:
bpf_probe_write_userblocked in Kernel Lockdown integrity mode (default on most distros now). - Hardening direction: Microsoft's Hornet LSM proposal for signed eBPF programs analogous to signed kernel modules.
Operational response (Datadog Workload Protection):
- Dedicated
bpfevent type in the agent capturing program loads, map ops, attachments fleet-wide. - Per-program helper + map inventory → detection rules flag
suspicious shapes (e.g. a network program sharing maps with a
file-system program, or use of
bpf_override_return). - Defensive research (BlackHat 2022 "Return to Sender") on protecting eBPF-based detections from malicious disablement.
Multi-tenancy with other eBPF tools¶
Shared kernel resources (TC priorities + handles, cgroup program
ordering, XDP slots, LSM hook chains) are effectively an
inter-vendor protocol. The 2022 Datadog × systems/cilium
outage — two independently-correct products colliding on TC handle
0:1, one of them cleaning up the other's filters — is the named
case study. The generalised lesson is
patterns/shared-kernel-resource-coordination: safer default
priorities, conservative cleanup that never auto-deletes shared
resources, and explicit vendor coordination.
Performance cost¶
Overhead depends heavily on:
- Hook type — uprobes ≫ kprobes (2 extra context switches);
raw tracepoints ≫ kprobes in efficiency (see Cloudflare's
ebpf_exporterbenchmark). - Map type —
BPF_MAP_TYPE_LRU_HASHneeds cross-CPU sync (slow);BPF_MAP_TYPE_PERCPU_ARRAYis CPU-local (fast). - Program complexity.
- Workload shape.
raw_syscallstracepoints notably affect connection-accept rates on edge nodes at Datadog scale.
cGroup-attached programs for process-set-scoped policy¶
A separate family of eBPF program types attaches at the Linux cGroup boundary rather than at host-wide hooks — making per-process-set policy enforceable without container isolation. Load-bearing types:
BPF_PROG_TYPE_CGROUP_SKB— egress (and ingress) packet filter scoped to a cGroup. Return 0 to drop, 1 to allow. Operates on IPs/ports, not hostnames.BPF_PROG_TYPE_CGROUP_SOCK_ADDR— hooks socketconnect4/connect6/bind/sendmsgsyscalls; can rewrite the destination IP + port before the kernel proceeds. Composes withCGROUP_SKBto build name-aware policy (rewrite DNS traffic to a userspace proxy + enforce IP-level drops based on what the proxy resolved).BPF_PROG_TYPE_CGROUP_SOCK, cGroup-scoped LSM hooks — similar granularity for socket-creation and mandatory-access- control checks.
Attached via bpf(BPF_PROG_ATTACH) or cilium/ebpf's
link.AttachCgroup with e.g. AttachCGroupInet4Connect /
AttachCGroupInetEgress. Load-bearing for workloads that need
policy tighter than the host but broader than the individual
process — GitHub's deployment-safety firewall
(Source: sources/2026-04-16-github-ebpf-deployment-safety)
is the canonical wiki instance of this shape (see
patterns/cgroup-scoped-egress-firewall +
patterns/dns-proxy-for-hostname-filtering). Different axis
from Datadog's syscall-hook / TC-classifier attachments.
Seen in¶
- sources/2025-11-18-datadog-ebpf-fim-filtering — File Integrity Monitoring hooks file syscalls, pushes events via ring buffer, filters ~94% of a ~10B/min event stream in-kernel using approver + discarder eBPF maps.
- sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security — 5-year operational retrospective on the full pitfall surface: verifier evolution, hook coverage, data-read consistency, cache pitfalls, rule-writing, eBPF as attack surface, multi-tenancy with systems/cilium, performance cost, safe rollout.
- sources/2025-06-20-cloudflare-how-cloudflare-blocked-a-monumental-7-3-tbps-ddos-attack — XDP + eBPF as the per-server kernel drop plane for Magic Transit DDoS mitigation; user-space systems/dosd compiles fingerprints into XDP programs, gossips across POPs; 7.3 Tbps autonomously mitigated across 477 data centres.
- sources/2026-04-16-github-ebpf-deployment-safety — GitHub
Engineering: cGroup-attached
CGROUP_SKB+CGROUP_SOCK_ADDRprograms implement per-deploy-script conditional network filtering to block circular dependencies on github.com without affecting customer traffic on the same host. DNS syscalls redirected to a userspace proxy (hostname-based blocklist); blocked queries attributed back to the calling process via a DNS transaction- ID → PID eBPF map. 6-month rollout.
Related¶
- concepts/ebpf-verifier — the safety + variability mechanism
- concepts/in-kernel-filtering — the volume-reduction move
- concepts/linux-cgroup — process-set isolation unit used as attach point for security-policy eBPF programs
- concepts/control-plane-data-plane-separation — rule engine (control) / eBPF maps+programs (data)
- systems/co-re — portability across kernel layouts
- systems/ebpf-manager — Datadog's OSS lifecycle library
- systems/datadog-workload-protection — multi-product consumer
- systems/datadog-workload-protection-fim — FIM subsystem
- systems/cilium — eBPF CNI; multi-tenancy case study
- patterns/approver-discarder-filter — kernel-side compile-time+runtime dual
- patterns/two-stage-evaluation — cheap kernel → rich user-space
- patterns/shared-kernel-resource-coordination — multi-vendor eBPF coexistence
- patterns/cgroup-scoped-egress-firewall — cGroup-attached
CGROUP_SKB+CGROUP_SOCK_ADDRfor per-process-set network policy - patterns/dns-proxy-for-hostname-filtering — DNS syscall redirect + userspace proxy + TXID↔PID attribution