Skip to content

Hardening eBPF for runtime security (Datadog, 2026-01-07)

Datadog Workload Protection's 5-year retrospective on running eBPF at scale — "six hard-won lessons" plus a rollout-safety coda. The post generalises beyond the FIM deep-dive (2025-11-18) into the full eBPF operational picture: kernel-version portability, hook coverage, data-read correctness, eBPF map lifecycles, rule-writing pitfalls, eBPF as attack surface, multi-tenancy with other eBPF tools, performance cost, and safe change rollout.

Why eBPF (vs. alternatives) — five-year hindsight

Datadog evaluated and rejected:

  • Kernel modules — deepest visibility but "invasive"; security/infra teams (rightly) don't want unverified custom code in the kernel.
  • inotify / fanotify / kprobes / tracepoints / perf events alone — need to be combined for holistic visibility.
  • ptrace + seccomp-bpf — user-space only.
  • Linux Audit framework — context-rich but heavy; scalability struggles under load.
  • Netlink / LD_PRELOAD / binfmt_misc — each with reliability and coverage tradeoffs.

eBPF wins a unique combination: verifier-gated safety + lower perf overhead than audit/ptrace + unified visibility across process / FS / network + consistent across namespaces, cgroups, containers + CO-RE (Compile Once – Run Everywhere) portability + BPF LSM for mandatory access control enforcement.

"eBPF offers a unique combination of performance, flexibility, and safety that few other kernel technologies can match." — but "despite numerous claims that eBPF is safe, secure, and comes with negligible performance impact, the reality — especially at scale — is nuanced."

Lesson 1 — Getting eBPF programs to load across kernels

eBPF programs that load on a dev machine can be rejected entirely on a production kernel. Compatibility surface:

  • Program type / helper / map evolution. Features come and go. E.g. bpf_get_current_pid_tgid in TC / cgroup SKB programs only from kernel 6.10.
  • Hook point availability / naming. Compiler optimisation can rename symbols (isra.* suffixes); distribution patches rename, remove, or inline functions.
  • Function inlining inconsistencies. If you hook a non-exported function, it may be inlined on some builds → a silent blind spot.
  • Verifier sensitivity. Map operation restrictions (map-value as key was pre-4.18 unsupported); instruction-count limit 4,096 → 1 million since 5.2; stricter bounds checks; stack-slot (8-byte) alignment issues with smaller stack variables; tail-call context-type check added in 6.11; helper availability depends on Lockdown mode; dead-code elimination from 4.15.

Consequence: verifier rejection or attach failure → lost telemetry → potential detection bypass.

Mitigations:

  • Comprehensive CI matrix of kernel versions and distros — "not supported unless actively tested in CI."
  • Centralised library — all Datadog eBPF products share systems/ebpf-manager (open-source Go library) for program lifecycle.
  • Minimum-viable hook set. ebpf-manager lets the product declare a minimal set of eBPF programs that must load+attach before Workload Protection is allowed to start; if not met, the product refuses to start with an actionable error.
  • Macros / wrappers abstracting per-kernel-version verifier quirks away from product code.

Lesson 2 — Hooking syscalls without leaving coverage gaps

Syscall-layer hooks are the bread and butter of runtime security, but the Linux syscall surface is deceptively non-uniform:

  • Compat syscalls. 32-bit binaries on x86_64 hit compat_* paths; missing these = blind spot on legacy / cross-compiled apps.
  • Tracepoint selection. Prefer raw_tracepoints/sys_enter|sys_exit — per-syscall tracepoints (e.g. tracepoints/syscalls/sys_enter_open) don't fire for 32-bit binaries on 64-bit kernels.
  • Syscall-number interpretation. Under raw tracepoints the syscall number depends on process architecture; need to check thread_info.status / .flags — wrong check silently misclassifies calls.
  • io_uring — async path to many syscalls including openat; hooks on traditional syscall entry miss all of it.
  • New syscalls. openat2 (5.6) etc. Continuous tracking of kernel development is required.
  • Exotic execution paths. binfmt_misc custom binary formats, call_usermodehelper, cgroup release agents, shebang-based interpreters — each can slip past a plain execve hook.

Mitigation: helpers in ebpf-manager dynamically identify and attach to all required hook points (not just static symbol names), plus continuous kernel-patch tracking.

Lesson 3 — Hooks that fire unreliably

Even correctly attached hooks can silently fail to trigger:

  • kretprobe maxactive caps concurrent return probes per-CPU; under heavy load extra activations are silently dropped.
  • Hardware-interrupt preemption. A kprobe preempted by a HW interrupt blocks all further kprobes on that CPU until it returns (safety mechanism; worth the cost of knowing).
  • Kernel module lifecycle. Hooks on module functions are lost on unload/reload → silent gaps.

Mitigations:

  • Minimise return-probe usage; when used, raise maxactive; use return probes only for enrichment, never for critical signal dispatch.
  • Prefer functions marked EXPORT_SYMBOL(*) over non-exported ones (less risk of inlining / bypass).
  • Actively watch module load/unload and dynamically re-attach.

Lesson 4 — Reading data consistently

Even when probes trigger, reading the data correctly is hard:

  • Kernel-structure volatility. task_struct field offsets differ across kernel versions and build configurations.
  • User-space memory. eBPF programs run with page faults disabled; reading a paged-out user-space buffer fails silently.
  • TOCTOU. Reading user-space data only after the kernel has copied it avoids paging, but opens a window for an attacker to modify it between check and use.
  • Path resolution. Relative paths + symlinks require accurate per-process CWD tracking + link resolution — error-prone and race-sensitive.
  • Non-linear skb. Reading packet data without first bpf_skb_pull_data returns uninitialised memory.

Mitigations:

  • systems/co-re (Compile Once – Run Everywhere) for structure offsets where kernel supports it, with runtime offset-guessing + hard-coded-offset fallbacks down to 4.14 (and some CentOS even older).
  • Self-tests at startup reporting results to Datadog's backend → customers get visibility into coverage state.
  • Resolve paths from kernel structures only (traverse kernel file tree to mount point → absolute path), avoiding user-space reads for path fields entirely.
  • For unavoidable user-data reads, wait until after the kernel copy and read from kernel memory.
  • Every TC entry point starts with bpf_skb_pull_data; bpf_skb_load_bytes used for good measure.

Lesson 5 — eBPF map & cache pitfalls

Kernel and user-space caches are where subtle bypasses live:

  • Hashmap sizing. Too small → can't track all in-flight processes (e.g. keyed by PID); too big → memory cost; poor lifecycle = leaked entries that lock out new tracking.
  • LRU semantics. eBPF's LRU hashmap "doesn't strictly adhere to traditional LRU semantics" — entries can be evicted even when the map isn't full, for perf reasons. Critical if detection logic relies on persistent context.
  • Preallocation tradeoffs. BPF_F_NO_PREALLOC lets memory grow on demand → unbounded + unpredictable footprint; lethal in Kubernetes with cgroup memory limits → OOM kills = security-coverage gaps. Same risk with object-attached storage map types (BPF_MAP_TYPE_INODE_STORAGE, TASK_STORAGE).
  • Blocking syscalls. Context-in-on-entry, context-out-on-exit breaks for syscalls like connect that can block indefinitely → attacker can pin map entries, exhaust capacity, poison / evade.
  • Lost & out-of-order events. Perf/ring buffers can drop under load; only one CPU reads per buffer. Out-of-order open before exec = wrong process attribution = missed detection.
  • Cache key/sync issues. PID is reused rapidly; inode is shared across hard links and can be reused; mount namespaces require host-level resolution.

Mitigations:

  • Instrument maps + ring buffers + user-space caches with detailed metrics for proactive sizing / leak / growth detection.
  • Prefer preallocated maps (predictable footprint over on-demand memory-pressure amplification).
  • Dynamic in-kernel filters computed from the active ruleset reduce user-space pressure (the FIM approver/discarder dual is this technique specialised to file monitoring).
  • Cache-reconciliation mechanisms — if exec is missed, detect via binary inode mismatch and resync from /proc.
  • Event reordering with a few-millisecond sliding window in user space; BPF_MAP_TYPE_RINGBUF (from 5.8) preserves order better than perf buffers where available.

Lesson 6 — Writing detection rules correctly

Even with perfect telemetry, rules fail on three common mistakes:

  • Symlinks. If the tool resolves symlinks and you write the rule on the symlink path, it never fires; Datadog resolves paths by default so only hard links need per-path enumeration.
  • Interpreters. execve("/tmp/x.py") for a shebanged Python script: kernel runs Python with the script as argv — but most eBPF tools capture only the script path, not the interpreter. Datadog surfaces both process.interpreter.* and process.ancestors.interpreter.* for rule authors.
  • Syscall args ≠ shell commands. Env vars are shell-resolved ($FOO never matches a syscall arg); binary paths are $PATH-resolved (curl becomes /usr/bin/curl).

Mitigation: dedicated CI that lets detection engineers test rules before release.

eBPF as attack surface

A "quietly important" lesson the post doesn't number but develops at length: eBPF's own power makes it a rootkit vector.

Datadog's 2021 hackathon built ebpfkit (BlackHat 2021, DEF CON 29) — a full eBPF rootkit demonstrating process hiding, network scanning, data exfiltration, C2, persistence. Mitigations have landed upstream (e.g. blocking bpf_probe_write_user in Lockdown integrity mode, default on most distros), but eBPF "remains highly privileged and potentially dangerous if left unrestricted."

CVE-2023-2163 and CVE-2024-41003 are cited as real-world verifier exploits.

Datadog's operational response:

  • Dedicated bpf event type captures all BPF activity — program loads, map ops, attachments.
  • Helper/map inventory per loaded program → rules flag suspicious patterns (e.g. a "network" program touching FS-related maps, or use of bpf_override_return / bpf_probe_write_user).
  • Defensive research. Datadog's BlackHat 2022 talk proposes strategies to protect eBPF-based detections from malicious tampering; several are now in the agent.

This is concepts/threat-modeling applied to the security tool itself — analogous in shape to S3's durability-review discipline.

Multi-tenancy with other eBPF tools

2022 incident: Datadog × systems/cilium.

Workload Protection attaches TC classifiers (BPF SCHED_CLS) to inspect and block network packets. systems/cilium, the common Kubernetes CNI, also attaches TC programs — with a hardcoded priority (1) and handle (0:1).

Race condition: Datadog's agent sometimes attached first, taking handle 0:1 under Cilium's expectations. When Cilium later loaded and replaced Datadog's filters, Datadog's network-namespace-leak-prevention cleanup logic interpreted the handle change as a signal to delete resources — and deleted Cilium's filters, breaking pod connectivity.

"Both solutions are independently correct and stable" — but uncoordinated use of a shared kernel resource caused a real outage (CiliumCon 2023 post-mortem).

Mitigations generalise as patterns/shared-kernel-resource-coordination:

  • Higher default TC priority (10) to let infra classifiers run first.
  • Conservative cleanup — hardened against races; never auto-delete queuing disciplines.
  • Vendor coordination on priority conventions and hardcoded handles; documentation + active monitoring for other eBPF tools that might disable or interfere.

Performance cost

Two cost surfaces:

  • Visible: user-space Agent CPU / memory.
  • Hidden (often larger): per-service runtime overhead from attached programs + workload-dependent eBPF map memory.

Key drivers:

  • Hook type. uprobes > kprobes (2 extra context switches); raw tracepoints ≫ kprobes in efficiency (Cloudflare's ebpf_exporter benchmark).
  • Map type. BPF_MAP_TYPE_LRU_HASH needs cross-CPU sync (slower); BPF_MAP_TYPE_PERCPU_ARRAY is CPU-local (fast).
  • Program complexity — often the dominant cost.

At Datadog scale, specific hooks (like raw_syscalls tracepoints) had to be handled carefully to not hurt connection-accept rates on edge endpoints.

Mitigations:

  • Filter aggressively in-kernel — FIM drops "up to 95% of captured events before they reach user space" under the default ruleset (consistent with the FIM post's 94% figure, see concepts/in-kernel-filtering).
  • Internal observability — dashboards, metrics, SLOs, runbooks so SRE can rule the agent in/out as a kernel-perf-issue source in incidents.
  • Test at scale on diverse environments.

Safe rollout

The post's coda: eBPF + rule changes can cause incidents worse than the ones they prevent. "Although the specific issue that hit CrowdStrike in 2024 might have been avoided with eBPF, there are many other ways for eBPF to trigger a similar incident."

  • Detection-engineering risks. Bad rule → active response (e.g. kill-process) on wrong traffic.
  • Engineering risks. Wrong-hook / oversized program → kernel-wide throttling.
  • Environment variability. Each customer env is unique; small changes cascade unpredictably.

Mitigations — explicitly a specialisation of patterns/staged-rollout:

  • Extensive CI matrix validates kernels + distros.
  • Dogfood first — every version lands on Datadog internal infra before customers.
  • Gradual controlled deployment for agent versions and detection content.

Outlook

  • CO-RE, BPF links, modern TC classifier programs have already smoothed the worst 2020-era pitfalls.
  • Open question: third-party eBPF access on managed / serverless platforms. GKE Dataplane V2 and Azure × Isovalent embrace eBPF for infra; AWS Fargate still ambiguous on third-party eBPF.
  • Ecosystem-hardening direction: Microsoft's Hornet LSM proposal introduces signature verification for eBPF programs, analogous to signed kernel modules.
  • Datadog continues to rely on eBPF across Workload Protection, Cloud Network Monitoring, Universal Service Monitoring.

Architectural takeaways

  1. "Safe and low-overhead" is a property of disciplined operation, not of eBPF itself. Datadog needed a kernel-version CI matrix, a shared lifecycle library, CO-RE + fallbacks, self-tests, and dedicated detection-rule CI to deliver the reputation eBPF has.
  2. Minimum-viable-capability gate is the reusable idea inside ebpf-manager: declare a critical subset of programs that must load+attach; if not, fail loudly at startup rather than silently serve with partial coverage.
  3. Shared kernel resources need explicit coordination (patterns/shared-kernel-resource-coordination). TC priorities and handles are effectively a protocol between eBPF vendors on the same host.
  4. The security tool is part of the attack surface. Threat-model your own agent's use of eBPF; inventory what maps/helpers each loaded program uses; detect tampering / malicious co-residents.
  5. Generalisation of the FIM post. The ~94% in-kernel filter rate is one instance of a broader stance: do as much as the verifier lets you in-kernel, make the user-space stage expressive and correct, and make the handoff observable.

Caveats

  • Post is a curated list of "hit most often" issues — the cited 6-lesson structure undersells the attack-surface and multi-tenancy sections, both of which are their own critical themes.
  • Hiring CTA at the end; tier-3-equivalent source (Datadog not in AGENTS.md formal list, treated as Tier 3 per companies/datadog.md). On-topic per scope filter: kernel-level distributed-systems internals, concrete production scaling trade-offs, named production incident (Datadog × Cilium).

Original

Last updated · 200 distilled / 1,178 read