Skip to content

SYSTEM Cited by 2 sources

Datadog Workload Protection — File Integrity Monitoring

Datadog Workload Protection File Integrity Monitoring (FIM) is the file-monitoring subsystem of systems/datadog-workload-protection, built on systems/ebpf. Its published challenge: detect unauthorized changes to sensitive files in real time across Datadog's entire infrastructure, with enough context to attribute each change to a process and container, at the scale of >10 billion file-related events per minute — without dropping events or degrading host performance.

Architecture

  • Agent, co-resident on each host, loads eBPF programs into kernel hooks covering file-related syscalls.
  • eBPF programs push events through a ring buffer to the Agent.
  • Agent runs a user-space rule engine; rule-matching events are serialized (~5 KB / event w/ process + container context) and forwarded to the Datadog backend for detection + notification.
  • Agent-side rules discard noise before it ever crosses the network (concepts/edge-filtering).

The key design moves

  1. Why eBPF over alternatives. Periodic filesystem scans miss tamper-then-revert and lack change context. inotify has no process/container correlation. auditd has the context but struggles with heavy system loads. eBPF gave the team real-time observability with context, verifier-gated kernel safety.
  2. Agent-side rule evaluation (concepts/edge-filtering). Naïve forwarding would be multi-TB/s fleet-wide; evaluating rules locally drops the stream from ~10B events/min to ~1M/min before it leaves the host.
  3. In-kernel filtering (concepts/in-kernel-filtering). The ring buffer itself becomes the bottleneck at ~5K syscalls/sec on sensitive workloads. Moving as much rule evaluation as eBPF verifier limits allow into kernel space drastically reduces user-space pressure.
  4. Two-stage evaluation (patterns/two-stage-evaluation). Cheap kernel pass using approver/discarder eBPF maps, then a second deeper pass in user space with rich correlations. The kernel stage protects the user-space stage; the user-space stage protects the backend.
  5. Approvers + discarders (patterns/approver-discarder-filter).
  6. Approvers — rule-compile-time extracted concrete values (e.g. /etc/passwd from a open.file.path == "/etc/passwd" clause) loaded into an eBPF map; passes are forwarded.
  7. Discarders — runtime-learned values the rule engine can prove will never match any active rule (e.g. /tmp under a /etc/*-only ruleset), loaded into an LRU eBPF map so hottest noise stays resident with bounded memory.

Reported outcome

  • ~94% of events pre-filtered directly in the kernel.
  • Input: >10B events/min → Output: ~1M events/min crossing the network to the backend.
  • No dropped events.
  • "Dramatically lower CPU usage" vs. forwarding-everything.

Seen in

Last updated · 200 distilled / 1,178 read