SYSTEM Cited by 8 sources

eBPF¶

eBPF (extended Berkeley Packet Filter) is a Linux kernel subsystem that lets user-space load small, verifier-gated programs that run in kernel context in response to system events (syscalls, network packets, kprobes, uprobes, tracepoints, scheduler hooks, LSM hooks, XDP, etc.). It replaces kernel-module development for most observability, networking, and security use cases: no custom-compiled kernel, no stability risk from a runaway module, safer than ptrace or the Linux Audit framework for equivalent visibility.

Execution model¶

Programs are attached to kernel hooks (kprobes, uprobes, tracepoints, XDP, TC classifiers, socket filters, cgroup attachments, LSM hooks, raw tracepoints).
The verifier statically proves termination, memory-safety, and bounded complexity before loading — this is the load-bearing safety mechanism, and the primary source of cross-kernel variability.
Programs can read/write eBPF maps — typed key/value data structures (hash, LRU hash, array, per-CPU array, ring buffer, task/inode storage, …) shared with user-space.
Output to user-space is typically via ring buffer maps (or the older perf buffer); the consumer mmap's the ring and reads events.

Why production platforms adopt it¶

Process + container context at syscall granularity — hooks can read task_struct, mount namespace, cgroup, etc., giving event-level attribution that inotify / fanotify lack.
Performance. In-kernel execution avoids context switches to user space for the common case (filter / drop / count); generally lower overhead than ptrace or audit.
Safety. Verifier + JIT sandbox replaces the correctness + ABI-stability risks of custom kernel modules.
Unified visibility. Process + FS + network + LSM through one mechanism — no need to combine inotify + Netlink + tracepoints separately.
Namespace / cgroup / container consistency — works uniformly across containerised workloads.
BPF LSM enables mandatory-access-control enforcement, not just observation.
Programmable data plane. User-space is the control plane that compiles rules into maps; eBPF programs + maps are the data plane (concepts/control-plane-data-plane-separation).

Portability: CO-RE ¶

eBPF's biggest operational improvement has been Compile Once – Run Everywhere: BTF-metadata-driven field-offset patching at load time, letting one compiled program run across kernels with different task_struct / sk_buff / etc. layouts. When CO-RE isn't available (older kernels without BTF, some distros), a fallback chain of runtime offset-guessing and hard-coded offsets can carry support back to ~kernel 4.14.

XDP as DDoS data plane¶

A different production shape from the Datadog / GitHub stories above: Cloudflare uses XDP (eXpress Data Path) + eBPF as the per-server kernel drop plane for volumetric-DDoS mitigation in Magic Transit customers. The loop:

XDP/eBPF program on the NIC's rx path samples packets into an eBPF map; samples are read by a user-space daemon.
User-space dosd (denial-of-service daemon) analyses samples for packet-header commonalities / anomalies, enumerates fingerprint permutations, uses a data-streaming algorithm to pick the most-selective match.
The winning fingerprint is compiled to an eBPF program and attached at XDP to drop matching packets before they enter the kernel stack — line-rate drop on modern NICs, at a per-packet cost close to the kernel's free-list write.
Top fingerprints are gossiped/multicast across servers within a data centre and globally.

Scale: the 2025-06-20 writeup describes the 7.3 Tbps / 4.8 Bpps attack being autonomously dropped across 477 data centres / 293 locations via this pipeline. First wiki instance of XDP as a DDoS data plane (the Datadog / GitHub instances sit at syscall / cGroup hooks, not XDP).

Operating eBPF at scale is nontrivial¶

"Despite numerous claims that eBPF is safe, secure, and comes with negligible performance impact, the reality — especially at scale — is nuanced." — Datadog Workload Protection team, 5 years in (Source: sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security)

Six operational pitfall classes the Datadog post documents:

Kernel-version portability — verifier evolution, helper availability, hook-point naming, inlining inconsistencies, dead-code elimination, map-operation restrictions. Mitigated by CI matrices across kernels/distros, CO-RE + fallbacks, shared lifecycle library (systems/ebpf-manager), minimum-viable hook-set gating at startup.
Syscall-hook coverage. Compat syscalls (32-bit on 64-bit), raw_tracepoints vs per-syscall tracepoints, syscall-number interpretation via thread_info, io_uring bypass of traditional syscall paths, new syscalls (e.g. openat2), exotic execution paths (binfmt_misc, call_usermodehelper, cgroup release agents, shebang interpreters).
Hooks failing to trigger. kretprobe maxactive caps, HW-interrupt preemption of kprobes, kernel-module lifecycle (hooks lost on module unload). Mitigation: prefer exported symbols, prefer entry over return probes, watch module lifecycle events.
Data-read correctness. Kernel-struct layout drift (CO-RE), user-space memory reads fail silently (page faults disabled in eBPF), TOCTOU between kernel copy and eBPF read, path resolution races, non-linear skb requiring bpf_skb_pull_data.
eBPF map & cache pitfalls. Hashmap sizing, LRU semantics ("doesn't strictly adhere to traditional LRU"), BPF_F_NO_PREALLOC unbounded footprint + Kubernetes OOM, blocking syscalls pinning map entries, lost / out-of-order events from perf buffers (one reader CPU), cache-key collisions (PID reuse, inode sharing via hard links, mount-namespace resolution).
Detection-rule correctness. Symlinks vs hard links, interpreter visibility (execve(*.py) running python), syscall args ≠ shell commands (env vars, $PATH resolution).

Constraints¶

Verifier limits are sharper on older kernels — complex rule evaluation often has to stay in user-space on a large fraction of real fleets. This is why patterns/two-stage-evaluation (cheap kernel → rich user-space) is a recurring shape.
Ring-buffer throughput can still be outpaced by event production, driving the need for concepts/in-kernel-filtering before emit.
Map memory is a bounded resource — LRU eviction trades coverage for RAM predictability; non-preallocated maps trade memory predictability for flexibility (bad trade in Kubernetes with enforced cgroup limits).

Abuse / attack surface¶

eBPF's kernel access + map persistence + hook breadth make it a rootkit-capable mechanism when left unrestricted. Named incidents / PoCs / CVEs:

ebpfkit (Datadog hackathon → BlackHat 2021 / DEF CON 29) — full eBPF rootkit: process hiding, network scanning, data exfiltration, C2, persistence.
CVE-2023-2163, CVE-2024-41003 — real eBPF verifier exploits. The verifier is the last line of defence against unprivileged-eBPF kernel exploitation.
Mitigations since: bpf_probe_write_user blocked in Kernel Lockdown integrity mode (default on most distros now).
Hardening direction: Microsoft's Hornet LSM proposal for signed eBPF programs analogous to signed kernel modules.

Operational response (Datadog Workload Protection):

Dedicated bpf event type in the agent capturing program loads, map ops, attachments fleet-wide.
Per-program helper + map inventory → detection rules flag suspicious shapes (e.g. a network program sharing maps with a file-system program, or use of bpf_override_return).
Defensive research (BlackHat 2022 "Return to Sender") on protecting eBPF-based detections from malicious disablement.

Multi-tenancy with other eBPF tools¶

Shared kernel resources (TC priorities + handles, cgroup program ordering, XDP slots, LSM hook chains) are effectively an inter-vendor protocol. The 2022 Datadog × systems/cilium outage — two independently-correct products colliding on TC handle 0:1, one of them cleaning up the other's filters — is the named case study. The generalised lesson is patterns/shared-kernel-resource-coordination: safer default priorities, conservative cleanup that never auto-deletes shared resources, and explicit vendor coordination.

Performance cost¶

Overhead depends heavily on:

Hook type — uprobes ≫ kprobes (2 extra context switches); raw tracepoints ≫ kprobes in efficiency (see Cloudflare's ebpf_exporter benchmark).
Map type — BPF_MAP_TYPE_LRU_HASH needs cross-CPU sync (slow); BPF_MAP_TYPE_PERCPU_ARRAY is CPU-local (fast).
Program complexity.
Workload shape. raw_syscalls tracepoints notably affect connection-accept rates on edge nodes at Datadog scale.

cGroup-attached programs for process-set-scoped policy¶

A separate family of eBPF program types attaches at the Linux cGroup boundary rather than at host-wide hooks — making per-process-set policy enforceable without container isolation. Load-bearing types:

BPF_PROG_TYPE_CGROUP_SKB — egress (and ingress) packet filter scoped to a cGroup. Return 0 to drop, 1 to allow. Operates on IPs/ports, not hostnames.
BPF_PROG_TYPE_CGROUP_SOCK_ADDR — hooks socket connect4 / connect6 / bind / sendmsg syscalls; can rewrite the destination IP + port before the kernel proceeds. Composes with CGROUP_SKB to build name-aware policy (rewrite DNS traffic to a userspace proxy + enforce IP-level drops based on what the proxy resolved).
BPF_PROG_TYPE_CGROUP_SOCK, cGroup-scoped LSM hooks — similar granularity for socket-creation and mandatory-access- control checks.

Attached via bpf(BPF_PROG_ATTACH) or cilium/ebpf's link.AttachCgroup with e.g. AttachCGroupInet4Connect / AttachCGroupInetEgress. Load-bearing for workloads that need policy tighter than the host but broader than the individual process — GitHub's deployment-safety firewall (Source: sources/2026-04-16-github-ebpf-deployment-safety) is the canonical wiki instance of this shape (see patterns/cgroup-scoped-egress-firewall + patterns/dns-proxy-for-hostname-filtering). Different axis from Datadog's syscall-hook / TC-classifier attachments.

Seen in¶

sources/2025-11-18-datadog-ebpf-fim-filtering — File Integrity Monitoring hooks file syscalls, pushes events via ring buffer, filters ~94% of a ~10B/min event stream in-kernel using approver + discarder eBPF maps.
sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security — 5-year operational retrospective on the full pitfall surface: verifier evolution, hook coverage, data-read consistency, cache pitfalls, rule-writing, eBPF as attack surface, multi-tenancy with systems/cilium, performance cost, safe rollout.
sources/2025-06-20-cloudflare-how-cloudflare-blocked-a-monumental-7-3-tbps-ddos-attack — XDP + eBPF as the per-server kernel drop plane for Magic Transit DDoS mitigation; user-space systems/dosd compiles fingerprints into XDP programs, gossips across POPs; 7.3 Tbps autonomously mitigated across 477 data centres.
sources/2026-04-16-github-ebpf-deployment-safety — GitHub Engineering: cGroup-attached CGROUP_SKB + CGROUP_SOCK_ADDR programs implement per-deploy-script conditional network filtering to block circular dependencies on github.com without affecting customer traffic on the same host. DNS syscalls redirected to a userspace proxy (hostname-based blocklist); blocked queries attributed back to the calling process via a DNS transaction- ID → PID eBPF map. 6-month rollout.
sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — first wiki instance of eBPF at the Linux scheduler layer. Netflix attaches tp_btf/sched_wakeup + tp_btf/sched_switch tracepoints on Titus hosts, derives per-task run-queue latency in- kernel via a PID-keyed BPF_MAP_TYPE_HASH, tags samples with cgroup ID (read via bpf_rcu_read_lock / _unlock kfuncs — RCU-protected deref of task->cgroups->dfl_cgrp->kn->id), rate-limits per-cgroup per-CPU via a BPF_MAP_TYPE_PERCPU_HASH checked before bpf_ringbuf_reserve (patterns/per-cgroup-rate-limiting-in-ebpf), and ships variable-length records via BPF_MAP_TYPE_RINGBUF to a Go agent that emits Atlas percentile timers and preempt-cause- tagged counters — the patterns/dual-metric-disambiguation shape that distinguishes cross-cgroup noisy neighbors from self CFS-quota throttling (see concepts/cpu-throttling-vs-noisy-neighbor). Explicit eBPF-vs- kernel-module framing: "While implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility." Baseline disclosed: runq.latency p99 ≈ 83.4 µs on an underloaded host. Canonical shape for the patterns/scheduler-tracepoint-based-monitoring pattern.
sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map — eBPF as the network-layer substrate of Netflix Service Topology. "We capture network flow records at the kernel level using eBPF technology — information about which services are connecting to which other services over the network. This gives us ground truth about actual network-level communication." The eBPF layer contributes the completeness property to the multi-source topology fusion (every service shows up regardless of instrumentation), with the corresponding limitation that "network-level information lacks application context." This use case sits on top of the FlowExporter /FlowCollector flow-attribution layer canonicalised in the 2025-04-08 post.

concepts/ebpf-verifier — the safety + variability mechanism
concepts/in-kernel-filtering — the volume-reduction move
concepts/linux-cgroup — process-set isolation unit used as attach point for security-policy eBPF programs
concepts/control-plane-data-plane-separation — rule engine (control) / eBPF maps+programs (data)
systems/co-re — portability across kernel layouts
systems/ebpf-manager — Datadog's OSS lifecycle library
systems/datadog-workload-protection — multi-product consumer
systems/datadog-workload-protection-fim — FIM subsystem
systems/cilium — eBPF CNI; multi-tenancy case study
patterns/approver-discarder-filter — kernel-side compile-time+runtime dual
patterns/two-stage-evaluation — cheap kernel → rich user-space
patterns/shared-kernel-resource-coordination — multi-vendor eBPF coexistence
patterns/cgroup-scoped-egress-firewall — cGroup-attached CGROUP_SKB + CGROUP_SOCK_ADDR for per-process-set network policy
patterns/dns-proxy-for-hostname-filtering — DNS syscall redirect + userspace proxy + TXID↔PID attribution

Seen in (continued)¶

sources/2026-04-22-allthingsdistributed-invisible-engineering-behind-lambdas-network — first wiki instance of eBPF as the Lambda data-plane packet-rewrite substrate. Prior wiki framing of eBPF was observability-side (Datadog Workload Protection), deployment-safety side (GitHub), CNI/service-mesh side (Cilium), or DDoS-mitigation side (Cloudflare dosd/Magic Transit). This post is AWS's canonical disclosure of eBPF used for data-plane packet-header rewriting on micro-VM networking: (1) Geneve tunnel VNI rewrite on egress/ingress (patterns/ebpf-header-rewrite-on-egress, concepts/geneve-tunnel-vni) — tunnels pre-created with dummy VNIs, eBPF rewrites to real VNI once function init supplies it, latency drops 150 ms → 200 μs; (2) stateless NAT (concepts/stateless-nat-via-ebpf) — replaces iptables + conntrack dual-stage stateful NAT with eBPF programs that mangle headers from predetermined mappings, 100× setup-latency improvement at 4,000-VM-per-worker density. Names the build-vs-rewrite-kernel decision calculus explicitly: a custom Linux kernel driver was considered and rejected to avoid "maintaining Lambda-specific patches upstream indefinitely" (patterns/upstream-the-fix); eBPF chosen over DPDK on lower-overhead + in-kernel-integration axes. Cites Cilium as the at-scale existence proof that de-risked production adoption — Lambda was "among the first in Lambda to use it in production" with "real questions about whether it would hold up at scale and pass the security reviews." Tenth-ish eBPF-at-AWS instance on the wiki but the first one explicitly at the hypervisor-adjacent data-plane layer. Also pairs with the parallel-batched-attachment optimization disclosed under concepts/rtnl-lock-contention — attaching one eBPF program to N veth devices in a single operation instead of N lock-reacquire calls was a load-bearing part of the boot-time 4,000-slot pre-creation fix.
sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs — canonical eBPF-as-flow-logging-substrate + attribution-primitive instance on the wiki. FlowExporter sidecars attach to TCP tracepoints on every Netflix AWS host and emit ~5M records/sec on socket close. Local workload identity is resolved in-kernel via an eBPF map populated by IPManAgent on Titus hosts (or a Metatron cert on EC2 hosts); a second (local_ipv4, local_port) → workload eBPF map disambiguates NAT64-free IPv6-to-IPv4 shared-IP sockets. Together with the 2024-09-11 noisy-neighbor run-queue-latency monitor, this cements Netflix's pattern of using eBPF + userspace-daemon + eBPF-map for per-socket and per-task workload attribution on Titus, with per-cgroup run-queue-latency and per-workload-TCP-flow-logs as the two canonical outputs. Complements Lambda's data-plane-mangling variant (2026-04-22) and the observability variants (Datadog, Cloudflare): Netflix's use case is TCP-lifecycle observation + local-identity-attribution, not packet mangling or rule-evaluation — and the backend pipeline is a
2026-05-07 — Cloudflare Copy Fail Linux vulnerability response. Canonical wiki first instance of BPF-LSM (Linux Security Module hook) as a runtime CVE-mitigation substrate. Cloudflare's bpf-lsm framework attached an eBPF program to the socket_bind LSM hook that denies AF_ALG binds for every caller except an explicit allow-list of legitimate binaries — surgical mitigation of CVE-2026-31431 (Copy Fail) without unloading the vulnerable algif_aead module (the researchers' recommended modprobe blacklist workaround had failed in staging due to dependency conflicts — patterns/staging-caught-mitigation-failure). Rollout followed patterns/visibility-before-enforcement-rollout: Phase 1 pushed ebpf_exporter config via salt to hook the socket() syscall and emit per-binary AF_ALG usage metrics across hundreds of thousands of servers; Phase 2 pushed the bpf-lsm enforcement program behind a separate gate once the allow-list was empirically validated. First-class distinct from the Datadog (observability), GitHub (deployment safety), Cloudflare Magic Transit (DDoS), AWS Lambda (data-plane mangling), and Netflix (flow logs + run-queue latency) eBPF instances already on this page: this is the BPF-LSM hook-denial shape at allowlist granularity. Canonical pairing with the LTS-kernel-backport- latency-gap concept — bpf-lsm is the runtime lever that covers the window between mainline fix and LTS backport. (Source: sources/2026-05-07-cloudflare-copy-fail-linux-vulnerability-response) heartbeat-based ownership map rather than an event stream. 40% → 0 Zuul misattribution over 2-week validation window is the load-bearing before/after evidence.
sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — first canonical eBPF-as-fleet-profiling-substrate instance on the wiki. Meta's Strobelight orchestrates 42+ profilers (many eBPF-backed) on every Meta production host. Canonical quote: "Strobelight's profilers are often, but not exclusively, built using eBPF... it's hard to imagine how Strobelight would work without it." Distinct use- case axis from the prior wiki eBPF instances — this is eBPF for production profiling (CPU, memory via jemalloc, off-CPU, request-latency, language-event for Python/Java/Erlang, AI/GPU), completing the triad alongside eBPF-for-security (Datadog Workload Protection), eBPF-for-networking / data-plane (Cloudflare DDoS, Lambda Geneve/NAT, Fly.io Sprites). The specific eBPF features Meta relies on: custom actions at sample time (Strobemeta reads thread- local storage into each sample for request-context tagging), cheap ring-buffer-to-userspace plumbing (raw stacks written to disk off-host, then symbolized via the centralised symbolization service), and user-authored profilers via bpftrace ad-hoc scripts (patterns/ad-hoc-bpftrace-profiler) which drop new- profiler-lead-time from weeks to hours. Economic anchor: the continuous LBR profiler feeds the FDO pipeline → 10-20% fewer servers on Meta's top 200 services — the canonical wiki datum for "eBPF profiling pays for itself at hyperscale".