Skip to content

SYSTEM Cited by 4 sources

eBPF

eBPF (extended Berkeley Packet Filter) is a Linux kernel subsystem that lets user-space load small, verifier-gated programs that run in kernel context in response to system events (syscalls, network packets, kprobes, uprobes, tracepoints, scheduler hooks, LSM hooks, XDP, etc.). It replaces kernel-module development for most observability, networking, and security use cases: no custom-compiled kernel, no stability risk from a runaway module, safer than ptrace or the Linux Audit framework for equivalent visibility.

Execution model

  • Programs are attached to kernel hooks (kprobes, uprobes, tracepoints, XDP, TC classifiers, socket filters, cgroup attachments, LSM hooks, raw tracepoints).
  • The verifier statically proves termination, memory-safety, and bounded complexity before loading — this is the load-bearing safety mechanism, and the primary source of cross-kernel variability.
  • Programs can read/write eBPF maps — typed key/value data structures (hash, LRU hash, array, per-CPU array, ring buffer, task/inode storage, …) shared with user-space.
  • Output to user-space is typically via ring buffer maps (or the older perf buffer); the consumer mmap's the ring and reads events.

Why production platforms adopt it

  • Process + container context at syscall granularity — hooks can read task_struct, mount namespace, cgroup, etc., giving event-level attribution that inotify / fanotify lack.
  • Performance. In-kernel execution avoids context switches to user space for the common case (filter / drop / count); generally lower overhead than ptrace or audit.
  • Safety. Verifier + JIT sandbox replaces the correctness + ABI-stability risks of custom kernel modules.
  • Unified visibility. Process + FS + network + LSM through one mechanism — no need to combine inotify + Netlink + tracepoints separately.
  • Namespace / cgroup / container consistency — works uniformly across containerised workloads.
  • BPF LSM enables mandatory-access-control enforcement, not just observation.
  • Programmable data plane. User-space is the control plane that compiles rules into maps; eBPF programs + maps are the data plane (concepts/control-plane-data-plane-separation).

Portability: CO-RE

eBPF's biggest operational improvement has been Compile Once – Run Everywhere: BTF-metadata-driven field-offset patching at load time, letting one compiled program run across kernels with different task_struct / sk_buff / etc. layouts. When CO-RE isn't available (older kernels without BTF, some distros), a fallback chain of runtime offset-guessing and hard-coded offsets can carry support back to ~kernel 4.14.

XDP as DDoS data plane

A different production shape from the Datadog / GitHub stories above: Cloudflare uses XDP (eXpress Data Path) + eBPF as the per-server kernel drop plane for volumetric-DDoS mitigation in Magic Transit customers. The loop:

  1. XDP/eBPF program on the NIC's rx path samples packets into an eBPF map; samples are read by a user-space daemon.
  2. User-space dosd (denial-of-service daemon) analyses samples for packet-header commonalities / anomalies, enumerates fingerprint permutations, uses a data-streaming algorithm to pick the most-selective match.
  3. The winning fingerprint is compiled to an eBPF program and attached at XDP to drop matching packets before they enter the kernel stack — line-rate drop on modern NICs, at a per-packet cost close to the kernel's free-list write.
  4. Top fingerprints are gossiped/multicast across servers within a data centre and globally.

Scale: the 2025-06-20 writeup describes the 7.3 Tbps / 4.8 Bpps attack being autonomously dropped across 477 data centres / 293 locations via this pipeline. First wiki instance of XDP as a DDoS data plane (the Datadog / GitHub instances sit at syscall / cGroup hooks, not XDP).

Operating eBPF at scale is nontrivial

"Despite numerous claims that eBPF is safe, secure, and comes with negligible performance impact, the reality — especially at scale — is nuanced." — Datadog Workload Protection team, 5 years in (Source: sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security)

Six operational pitfall classes the Datadog post documents:

  1. Kernel-version portability — verifier evolution, helper availability, hook-point naming, inlining inconsistencies, dead-code elimination, map-operation restrictions. Mitigated by CI matrices across kernels/distros, CO-RE + fallbacks, shared lifecycle library (systems/ebpf-manager), minimum-viable hook-set gating at startup.
  2. Syscall-hook coverage. Compat syscalls (32-bit on 64-bit), raw_tracepoints vs per-syscall tracepoints, syscall-number interpretation via thread_info, io_uring bypass of traditional syscall paths, new syscalls (e.g. openat2), exotic execution paths (binfmt_misc, call_usermodehelper, cgroup release agents, shebang interpreters).
  3. Hooks failing to trigger. kretprobe maxactive caps, HW-interrupt preemption of kprobes, kernel-module lifecycle (hooks lost on module unload). Mitigation: prefer exported symbols, prefer entry over return probes, watch module lifecycle events.
  4. Data-read correctness. Kernel-struct layout drift (CO-RE), user-space memory reads fail silently (page faults disabled in eBPF), TOCTOU between kernel copy and eBPF read, path resolution races, non-linear skb requiring bpf_skb_pull_data.
  5. eBPF map & cache pitfalls. Hashmap sizing, LRU semantics ("doesn't strictly adhere to traditional LRU"), BPF_F_NO_PREALLOC unbounded footprint + Kubernetes OOM, blocking syscalls pinning map entries, lost / out-of-order events from perf buffers (one reader CPU), cache-key collisions (PID reuse, inode sharing via hard links, mount-namespace resolution).
  6. Detection-rule correctness. Symlinks vs hard links, interpreter visibility (execve(*.py) running python), syscall args ≠ shell commands (env vars, $PATH resolution).

Constraints

  • Verifier limits are sharper on older kernels — complex rule evaluation often has to stay in user-space on a large fraction of real fleets. This is why patterns/two-stage-evaluation (cheap kernel → rich user-space) is a recurring shape.
  • Ring-buffer throughput can still be outpaced by event production, driving the need for concepts/in-kernel-filtering before emit.
  • Map memory is a bounded resource — LRU eviction trades coverage for RAM predictability; non-preallocated maps trade memory predictability for flexibility (bad trade in Kubernetes with enforced cgroup limits).

Abuse / attack surface

eBPF's kernel access + map persistence + hook breadth make it a rootkit-capable mechanism when left unrestricted. Named incidents / PoCs / CVEs:

  • ebpfkit (Datadog hackathon → BlackHat 2021 / DEF CON 29) — full eBPF rootkit: process hiding, network scanning, data exfiltration, C2, persistence.
  • CVE-2023-2163, CVE-2024-41003 — real eBPF verifier exploits. The verifier is the last line of defence against unprivileged-eBPF kernel exploitation.
  • Mitigations since: bpf_probe_write_user blocked in Kernel Lockdown integrity mode (default on most distros now).
  • Hardening direction: Microsoft's Hornet LSM proposal for signed eBPF programs analogous to signed kernel modules.

Operational response (Datadog Workload Protection):

  • Dedicated bpf event type in the agent capturing program loads, map ops, attachments fleet-wide.
  • Per-program helper + map inventory → detection rules flag suspicious shapes (e.g. a network program sharing maps with a file-system program, or use of bpf_override_return).
  • Defensive research (BlackHat 2022 "Return to Sender") on protecting eBPF-based detections from malicious disablement.

Multi-tenancy with other eBPF tools

Shared kernel resources (TC priorities + handles, cgroup program ordering, XDP slots, LSM hook chains) are effectively an inter-vendor protocol. The 2022 Datadog × systems/cilium outage — two independently-correct products colliding on TC handle 0:1, one of them cleaning up the other's filters — is the named case study. The generalised lesson is patterns/shared-kernel-resource-coordination: safer default priorities, conservative cleanup that never auto-deletes shared resources, and explicit vendor coordination.

Performance cost

Overhead depends heavily on:

  • Hook type — uprobes ≫ kprobes (2 extra context switches); raw tracepoints ≫ kprobes in efficiency (see Cloudflare's ebpf_exporter benchmark).
  • Map typeBPF_MAP_TYPE_LRU_HASH needs cross-CPU sync (slow); BPF_MAP_TYPE_PERCPU_ARRAY is CPU-local (fast).
  • Program complexity.
  • Workload shape. raw_syscalls tracepoints notably affect connection-accept rates on edge nodes at Datadog scale.

cGroup-attached programs for process-set-scoped policy

A separate family of eBPF program types attaches at the Linux cGroup boundary rather than at host-wide hooks — making per-process-set policy enforceable without container isolation. Load-bearing types:

  • BPF_PROG_TYPE_CGROUP_SKB — egress (and ingress) packet filter scoped to a cGroup. Return 0 to drop, 1 to allow. Operates on IPs/ports, not hostnames.
  • BPF_PROG_TYPE_CGROUP_SOCK_ADDR — hooks socket connect4 / connect6 / bind / sendmsg syscalls; can rewrite the destination IP + port before the kernel proceeds. Composes with CGROUP_SKB to build name-aware policy (rewrite DNS traffic to a userspace proxy + enforce IP-level drops based on what the proxy resolved).
  • BPF_PROG_TYPE_CGROUP_SOCK, cGroup-scoped LSM hooks — similar granularity for socket-creation and mandatory-access- control checks.

Attached via bpf(BPF_PROG_ATTACH) or cilium/ebpf's link.AttachCgroup with e.g. AttachCGroupInet4Connect / AttachCGroupInetEgress. Load-bearing for workloads that need policy tighter than the host but broader than the individual process — GitHub's deployment-safety firewall (Source: sources/2026-04-16-github-ebpf-deployment-safety) is the canonical wiki instance of this shape (see patterns/cgroup-scoped-egress-firewall + patterns/dns-proxy-for-hostname-filtering). Different axis from Datadog's syscall-hook / TC-classifier attachments.

Seen in

Last updated · 200 distilled / 1,178 read