Skip to content

SYSTEM Cited by 8 sources

eBPF

eBPF (extended Berkeley Packet Filter) is a Linux kernel subsystem that lets user-space load small, verifier-gated programs that run in kernel context in response to system events (syscalls, network packets, kprobes, uprobes, tracepoints, scheduler hooks, LSM hooks, XDP, etc.). It replaces kernel-module development for most observability, networking, and security use cases: no custom-compiled kernel, no stability risk from a runaway module, safer than ptrace or the Linux Audit framework for equivalent visibility.

Execution model

  • Programs are attached to kernel hooks (kprobes, uprobes, tracepoints, XDP, TC classifiers, socket filters, cgroup attachments, LSM hooks, raw tracepoints).
  • The verifier statically proves termination, memory-safety, and bounded complexity before loading — this is the load-bearing safety mechanism, and the primary source of cross-kernel variability.
  • Programs can read/write eBPF maps — typed key/value data structures (hash, LRU hash, array, per-CPU array, ring buffer, task/inode storage, …) shared with user-space.
  • Output to user-space is typically via ring buffer maps (or the older perf buffer); the consumer mmap's the ring and reads events.

Why production platforms adopt it

  • Process + container context at syscall granularity — hooks can read task_struct, mount namespace, cgroup, etc., giving event-level attribution that inotify / fanotify lack.
  • Performance. In-kernel execution avoids context switches to user space for the common case (filter / drop / count); generally lower overhead than ptrace or audit.
  • Safety. Verifier + JIT sandbox replaces the correctness + ABI-stability risks of custom kernel modules.
  • Unified visibility. Process + FS + network + LSM through one mechanism — no need to combine inotify + Netlink + tracepoints separately.
  • Namespace / cgroup / container consistency — works uniformly across containerised workloads.
  • BPF LSM enables mandatory-access-control enforcement, not just observation.
  • Programmable data plane. User-space is the control plane that compiles rules into maps; eBPF programs + maps are the data plane (concepts/control-plane-data-plane-separation).

Portability: CO-RE

eBPF's biggest operational improvement has been Compile Once – Run Everywhere: BTF-metadata-driven field-offset patching at load time, letting one compiled program run across kernels with different task_struct / sk_buff / etc. layouts. When CO-RE isn't available (older kernels without BTF, some distros), a fallback chain of runtime offset-guessing and hard-coded offsets can carry support back to ~kernel 4.14.

XDP as DDoS data plane

A different production shape from the Datadog / GitHub stories above: Cloudflare uses XDP (eXpress Data Path) + eBPF as the per-server kernel drop plane for volumetric-DDoS mitigation in Magic Transit customers. The loop:

  1. XDP/eBPF program on the NIC's rx path samples packets into an eBPF map; samples are read by a user-space daemon.
  2. User-space dosd (denial-of-service daemon) analyses samples for packet-header commonalities / anomalies, enumerates fingerprint permutations, uses a data-streaming algorithm to pick the most-selective match.
  3. The winning fingerprint is compiled to an eBPF program and attached at XDP to drop matching packets before they enter the kernel stack — line-rate drop on modern NICs, at a per-packet cost close to the kernel's free-list write.
  4. Top fingerprints are gossiped/multicast across servers within a data centre and globally.

Scale: the 2025-06-20 writeup describes the 7.3 Tbps / 4.8 Bpps attack being autonomously dropped across 477 data centres / 293 locations via this pipeline. First wiki instance of XDP as a DDoS data plane (the Datadog / GitHub instances sit at syscall / cGroup hooks, not XDP).

Operating eBPF at scale is nontrivial

"Despite numerous claims that eBPF is safe, secure, and comes with negligible performance impact, the reality — especially at scale — is nuanced." — Datadog Workload Protection team, 5 years in (Source: sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security)

Six operational pitfall classes the Datadog post documents:

  1. Kernel-version portability — verifier evolution, helper availability, hook-point naming, inlining inconsistencies, dead-code elimination, map-operation restrictions. Mitigated by CI matrices across kernels/distros, CO-RE + fallbacks, shared lifecycle library (systems/ebpf-manager), minimum-viable hook-set gating at startup.
  2. Syscall-hook coverage. Compat syscalls (32-bit on 64-bit), raw_tracepoints vs per-syscall tracepoints, syscall-number interpretation via thread_info, io_uring bypass of traditional syscall paths, new syscalls (e.g. openat2), exotic execution paths (binfmt_misc, call_usermodehelper, cgroup release agents, shebang interpreters).
  3. Hooks failing to trigger. kretprobe maxactive caps, HW-interrupt preemption of kprobes, kernel-module lifecycle (hooks lost on module unload). Mitigation: prefer exported symbols, prefer entry over return probes, watch module lifecycle events.
  4. Data-read correctness. Kernel-struct layout drift (CO-RE), user-space memory reads fail silently (page faults disabled in eBPF), TOCTOU between kernel copy and eBPF read, path resolution races, non-linear skb requiring bpf_skb_pull_data.
  5. eBPF map & cache pitfalls. Hashmap sizing, LRU semantics ("doesn't strictly adhere to traditional LRU"), BPF_F_NO_PREALLOC unbounded footprint + Kubernetes OOM, blocking syscalls pinning map entries, lost / out-of-order events from perf buffers (one reader CPU), cache-key collisions (PID reuse, inode sharing via hard links, mount-namespace resolution).
  6. Detection-rule correctness. Symlinks vs hard links, interpreter visibility (execve(*.py) running python), syscall args ≠ shell commands (env vars, $PATH resolution).

Constraints

  • Verifier limits are sharper on older kernels — complex rule evaluation often has to stay in user-space on a large fraction of real fleets. This is why patterns/two-stage-evaluation (cheap kernel → rich user-space) is a recurring shape.
  • Ring-buffer throughput can still be outpaced by event production, driving the need for concepts/in-kernel-filtering before emit.
  • Map memory is a bounded resource — LRU eviction trades coverage for RAM predictability; non-preallocated maps trade memory predictability for flexibility (bad trade in Kubernetes with enforced cgroup limits).

Abuse / attack surface

eBPF's kernel access + map persistence + hook breadth make it a rootkit-capable mechanism when left unrestricted. Named incidents / PoCs / CVEs:

  • ebpfkit (Datadog hackathon → BlackHat 2021 / DEF CON 29) — full eBPF rootkit: process hiding, network scanning, data exfiltration, C2, persistence.
  • CVE-2023-2163, CVE-2024-41003 — real eBPF verifier exploits. The verifier is the last line of defence against unprivileged-eBPF kernel exploitation.
  • Mitigations since: bpf_probe_write_user blocked in Kernel Lockdown integrity mode (default on most distros now).
  • Hardening direction: Microsoft's Hornet LSM proposal for signed eBPF programs analogous to signed kernel modules.

Operational response (Datadog Workload Protection):

  • Dedicated bpf event type in the agent capturing program loads, map ops, attachments fleet-wide.
  • Per-program helper + map inventory → detection rules flag suspicious shapes (e.g. a network program sharing maps with a file-system program, or use of bpf_override_return).
  • Defensive research (BlackHat 2022 "Return to Sender") on protecting eBPF-based detections from malicious disablement.

Multi-tenancy with other eBPF tools

Shared kernel resources (TC priorities + handles, cgroup program ordering, XDP slots, LSM hook chains) are effectively an inter-vendor protocol. The 2022 Datadog × systems/cilium outage — two independently-correct products colliding on TC handle 0:1, one of them cleaning up the other's filters — is the named case study. The generalised lesson is patterns/shared-kernel-resource-coordination: safer default priorities, conservative cleanup that never auto-deletes shared resources, and explicit vendor coordination.

Performance cost

Overhead depends heavily on:

  • Hook type — uprobes ≫ kprobes (2 extra context switches); raw tracepoints ≫ kprobes in efficiency (see Cloudflare's ebpf_exporter benchmark).
  • Map typeBPF_MAP_TYPE_LRU_HASH needs cross-CPU sync (slow); BPF_MAP_TYPE_PERCPU_ARRAY is CPU-local (fast).
  • Program complexity.
  • Workload shape. raw_syscalls tracepoints notably affect connection-accept rates on edge nodes at Datadog scale.

cGroup-attached programs for process-set-scoped policy

A separate family of eBPF program types attaches at the Linux cGroup boundary rather than at host-wide hooks — making per-process-set policy enforceable without container isolation. Load-bearing types:

  • BPF_PROG_TYPE_CGROUP_SKB — egress (and ingress) packet filter scoped to a cGroup. Return 0 to drop, 1 to allow. Operates on IPs/ports, not hostnames.
  • BPF_PROG_TYPE_CGROUP_SOCK_ADDR — hooks socket connect4 / connect6 / bind / sendmsg syscalls; can rewrite the destination IP + port before the kernel proceeds. Composes with CGROUP_SKB to build name-aware policy (rewrite DNS traffic to a userspace proxy + enforce IP-level drops based on what the proxy resolved).
  • BPF_PROG_TYPE_CGROUP_SOCK, cGroup-scoped LSM hooks — similar granularity for socket-creation and mandatory-access- control checks.

Attached via bpf(BPF_PROG_ATTACH) or cilium/ebpf's link.AttachCgroup with e.g. AttachCGroupInet4Connect / AttachCGroupInetEgress. Load-bearing for workloads that need policy tighter than the host but broader than the individual process — GitHub's deployment-safety firewall (Source: sources/2026-04-16-github-ebpf-deployment-safety) is the canonical wiki instance of this shape (see patterns/cgroup-scoped-egress-firewall + patterns/dns-proxy-for-hostname-filtering). Different axis from Datadog's syscall-hook / TC-classifier attachments.

Seen in

Seen in (continued)

  • sources/2026-04-22-allthingsdistributed-invisible-engineering-behind-lambdas-networkfirst wiki instance of eBPF as the Lambda data-plane packet-rewrite substrate. Prior wiki framing of eBPF was observability-side (Datadog Workload Protection), deployment-safety side (GitHub), CNI/service-mesh side (Cilium), or DDoS-mitigation side (Cloudflare dosd/Magic Transit). This post is AWS's canonical disclosure of eBPF used for data-plane packet-header rewriting on micro-VM networking: (1) Geneve tunnel VNI rewrite on egress/ingress (patterns/ebpf-header-rewrite-on-egress, concepts/geneve-tunnel-vni) — tunnels pre-created with dummy VNIs, eBPF rewrites to real VNI once function init supplies it, latency drops 150 ms → 200 μs; (2) stateless NAT (concepts/stateless-nat-via-ebpf) — replaces iptables + conntrack dual-stage stateful NAT with eBPF programs that mangle headers from predetermined mappings, 100× setup-latency improvement at 4,000-VM-per-worker density. Names the build-vs-rewrite-kernel decision calculus explicitly: a custom Linux kernel driver was considered and rejected to avoid "maintaining Lambda-specific patches upstream indefinitely" (patterns/upstream-the-fix); eBPF chosen over DPDK on lower-overhead + in-kernel-integration axes. Cites Cilium as the at-scale existence proof that de-risked production adoption — Lambda was "among the first in Lambda to use it in production" with "real questions about whether it would hold up at scale and pass the security reviews." Tenth-ish eBPF-at-AWS instance on the wiki but the first one explicitly at the hypervisor-adjacent data-plane layer. Also pairs with the parallel-batched-attachment optimization disclosed under concepts/rtnl-lock-contention — attaching one eBPF program to N veth devices in a single operation instead of N lock-reacquire calls was a load-bearing part of the boot-time 4,000-slot pre-creation fix.
  • sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs — canonical eBPF-as-flow-logging-substrate + attribution-primitive instance on the wiki. FlowExporter sidecars attach to TCP tracepoints on every Netflix AWS host and emit ~5M records/sec on socket close. Local workload identity is resolved in-kernel via an eBPF map populated by IPManAgent on Titus hosts (or a Metatron cert on EC2 hosts); a second (local_ipv4, local_port) → workload eBPF map disambiguates NAT64-free IPv6-to-IPv4 shared-IP sockets. Together with the 2024-09-11 noisy-neighbor run-queue-latency monitor, this cements Netflix's pattern of using eBPF + userspace-daemon + eBPF-map for per-socket and per-task workload attribution on Titus, with per-cgroup run-queue-latency and per-workload-TCP-flow-logs as the two canonical outputs. Complements Lambda's data-plane-mangling variant (2026-04-22) and the observability variants (Datadog, Cloudflare): Netflix's use case is TCP-lifecycle observation + local-identity-attribution, not packet mangling or rule-evaluation — and the backend pipeline is a
  • 2026-05-07 — Cloudflare Copy Fail Linux vulnerability response. Canonical wiki first instance of BPF-LSM (Linux Security Module hook) as a runtime CVE-mitigation substrate. Cloudflare's bpf-lsm framework attached an eBPF program to the socket_bind LSM hook that denies AF_ALG binds for every caller except an explicit allow-list of legitimate binaries — surgical mitigation of CVE-2026-31431 (Copy Fail) without unloading the vulnerable algif_aead module (the researchers' recommended modprobe blacklist workaround had failed in staging due to dependency conflicts — patterns/staging-caught-mitigation-failure). Rollout followed patterns/visibility-before-enforcement-rollout: Phase 1 pushed ebpf_exporter config via salt to hook the socket() syscall and emit per-binary AF_ALG usage metrics across hundreds of thousands of servers; Phase 2 pushed the bpf-lsm enforcement program behind a separate gate once the allow-list was empirically validated. First-class distinct from the Datadog (observability), GitHub (deployment safety), Cloudflare Magic Transit (DDoS), AWS Lambda (data-plane mangling), and Netflix (flow logs + run-queue latency) eBPF instances already on this page: this is the BPF-LSM hook-denial shape at allowlist granularity. Canonical pairing with the LTS-kernel-backport- latency-gap concept — bpf-lsm is the runtime lever that covers the window between mainline fix and LTS backport. (Source: sources/2026-05-07-cloudflare-copy-fail-linux-vulnerability-response) heartbeat-based ownership map rather than an event stream. 40% → 0 Zuul misattribution over 2-week validation window is the load-bearing before/after evidence.

  • sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technologyfirst canonical eBPF-as-fleet-profiling-substrate instance on the wiki. Meta's Strobelight orchestrates 42+ profilers (many eBPF-backed) on every Meta production host. Canonical quote: "Strobelight's profilers are often, but not exclusively, built using eBPF... it's hard to imagine how Strobelight would work without it." Distinct use- case axis from the prior wiki eBPF instances — this is eBPF for production profiling (CPU, memory via jemalloc, off-CPU, request-latency, language-event for Python/Java/Erlang, AI/GPU), completing the triad alongside eBPF-for-security (Datadog Workload Protection), eBPF-for-networking / data-plane (Cloudflare DDoS, Lambda Geneve/NAT, Fly.io Sprites). The specific eBPF features Meta relies on: custom actions at sample time (Strobemeta reads thread- local storage into each sample for request-context tagging), cheap ring-buffer-to-userspace plumbing (raw stacks written to disk off-host, then symbolized via the centralised symbolization service), and user-authored profilers via bpftrace ad-hoc scripts (patterns/ad-hoc-bpftrace-profiler) which drop new- profiler-lead-time from weeks to hours. Economic anchor: the continuous LBR profiler feeds the FDO pipeline10-20% fewer servers on Meta's top 200 services — the canonical wiki datum for "eBPF profiling pays for itself at hyperscale".

Last updated · 542 distilled / 1,571 read