Skip to content

PATTERN Cited by 1 source

Shared kernel resource coordination

Shared kernel resource coordination is the pattern of treating certain kernel-object namespaces — TC priorities and handles, cgroup program attach ordering, XDP program slots, LSM hook chains — as an inter-vendor protocol when multiple independent eBPF (or eBPF-like) tools coexist on the same host.

Without explicit coordination, independently correct products can collide on these namespaces and cause real outages — the 2022 Datadog × systems/cilium TC handle-collision incident is the named case study (Source: sources/2026-01-07-datadog-hardening-ebpf-for-runtime-security).

The shape of the collision

Two eBPF tools attach programs to the same kernel attachment surface:

  • Same program type (e.g. both use SCHED_CLS TC classifiers).
  • Same hook point (e.g. pod network interface in Kubernetes).
  • Shared identifier space (e.g. TC priority, TC handle, cgroup program array slot) — contended without explicit protocol.

Timing + implicit assumptions about who owns what then decide the outcome:

  • If one tool hard-codes priority=1 and handle=0:1, and the other loads first at the same priority, the second is silently "above" or "below" the first — which breaks assumptions on either side.
  • If any tool reacts to unexpected handle changes (e.g. as a namespace-leak signal) by deleting those resources, it will delete the other vendor's programs.

The Datadog × Cilium case

  • Setup. Datadog systems/datadog-workload-protection uses TC classifier programs to inspect pod-network packets. Cilium uses TC programs for pod connectivity, with a hardcoded priority (1) and handle (0:1).
  • Race. On pod bring-up, Datadog's Agent sometimes loaded its TC filters before Cilium, taking handle 0:1 under Cilium.
  • Cleanup misfire. When Cilium later loaded and replaced Datadog's filters, Datadog's cleanup logic — designed to prevent network-namespace resource leaks — saw the handle change as a cleanup signal and deleted Cilium's filters.
  • Outage. Pods lost connectivity entirely until manual restart.

Mitigations

From Datadog's post, generalisable as the pattern:

  1. Safer defaults for shared namespaces. Datadog raised its TC priority default to 10 so infrastructure (CNI) classifiers run first. Similar defaults should be picked knowing what infrastructure / platform tooling tends to claim.
  2. Conservative cleanup. Hardened cleanup against races; default to never auto-deleting queuing disciplines and shared kernel resources — the worst case of "leak until manual intervention" is still better than the worst case of "break another vendor".
  3. Vendor coordination. Document priority conventions and hardcoded handles; talk to peer vendors proactively (Cilium, Isovalent, Falco, etc.). Same shape as xDS-style coordination on a different substrate.
  4. Co-resident detection + warnings. Inventory who else is using bpf(2) on the host; alert when a new eBPF vendor appears or when another process is suspected of disabling / interfering with monitoring.
  5. Published interface. Ideally, shared kernel-resource namespaces grow a public "here's who uses what, in what order, on what hook" convention — implicit protocols should become explicit.

Applicability beyond TC

The same pattern applies on:

  • Cgroup-attached program orderingdocumented kernel rules already govern ordering, overrides, and chaining. The operational lesson: read those rules before shipping cgroup programs alongside other vendors.
  • XDP program slots — one program per device without special chaining infrastructure.
  • LSM hook chains — BPF LSM programs share stacking with other LSMs.
  • BPF map pinning in /sys/fs/bpf/* — shared pin paths across vendors need a naming protocol.

When to reach for it

Any time your product attaches eBPF programs to hooks that are realistically shared on a customer host with third-party eBPF tooling. In modern Kubernetes that is "always" for anything touching pod networking or cgroup hooks.

Seen in

Last updated · 200 distilled / 1,178 read