Skip to content

PATTERN Cited by 1 source

Per-cgroup rate limiting in eBPF

Rate-limit event emission from an eBPF program per attribution dimension (e.g. cgroup ID) per CPU, inside the kernel, by maintaining a (cgroup_id, cpu) → last_ts map and comparing the current timestamp against the stored one before doing any further work — critically, before allocating space in the output ring buffer.

The pattern makes high-frequency in-kernel observability sustainable by dropping redundant events at the source, keyed by the dimension that will become the output metric's tag.

Motivation

Linux scheduler tracepoints (sched_wakeup, sched_switch) fire millions of times per second on a busy multi-core host. An eBPF program that forwards every event to userspace via a ring buffer will:

  • Burn kernel ring-buffer bandwidth for events whose information is duplicated by neighbours microseconds away.
  • Saturate the userspace consumer's CPU with redundant samples.
  • Deliver no additional signal beyond what a sampled stream already gives.

"The sheer number of data points was causing the userspace program to use too much CPU, so we implemented a rate limiter in eBPF to sample the data." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)

Shape

          Leading tracepoint (sched_switch, after timestamp subtract)
   ┌──────────────────────────────────────────────────────────┐
   │  last_ts = map_lookup(cgroup_id_to_last_event_ts,         │
   │                       &cgroup_id);                         │
   │  if (now - last_ts < RATE_LIMIT_NS)                        │
   │        return 0;   // drop — cheap, no allocation          │
   └──────────────────────────────────────┬───────────────────┘
                              bpf_ringbuf_reserve(...)
                              fill event fields
                              bpf_ringbuf_submit(...)
                              map_update(last_ts, cgroup_id, now)
                                    user-space ring buffer

Why a PERCPU_HASH, not a plain HASH

Netflix keys the limiter last_ts map as BPF_MAP_TYPE_PERCPU_HASH — each CPU has its own view of the cgroup_id → last_ts state. This is deliberate:

  • No concurrent writes across CPUs — eliminates the BPF atomic / bpf_spin_lock cost on every hot event.
  • Per-CPU rate bucketing matches the metric's shape. The output runq.latency percentile is per-container across the whole host; preserving one sample per cgroup per CPU still gives the percentile estimator adequate points.
  • Sampling independence across cores. A hot cgroup running on many cores will still emit at least one sample per core per window, which matches where the information diversity is.

A single shared HASH would either require locking (contention on hot cgroups, the exact cgroups we most want to measure) or race (silently missing the rate limit on concurrent events).

Why the check happens before ringbuf_reserve

Placing the check before the ring-buffer reserve is load-bearing:

  1. No ring-buffer bytes are consumed for dropped events. Reserve
  2. submit + discard still imposes ring-buffer pressure; a pre-check is free.
  3. The eBPF program short-circuits cheaply. A map lookup + comparison is ~nanoseconds; an unnecessary reserve path is much more expensive.
  4. It gates all subsequent work, including any field enrichment (cgroup-id resolution for the previous task in sched_switch, timestamp derivation, etc.).

Reference code

struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, MAX_TASK_ENTRIES);
    __uint(key_size, sizeof(u64));
    __uint(value_size, sizeof(u64));
} cgroup_id_to_last_event_ts SEC(".maps");

// Inside sched_switch handler, after computing runq_lat:
u64 *last_ts = bpf_map_lookup_elem(&cgroup_id_to_last_event_ts,
                                   &cgroup_id);
u64 last_ts_val = last_ts ? *last_ts : 0;
if (now - last_ts_val < RATE_LIMIT_NS)
    return 0;  // drop

struct runq_event *event =
    bpf_ringbuf_reserve(&events, sizeof(*event), 0);
if (event) {
    event->prev_cgroup_id = prev_cgroup_id;
    event->cgroup_id      = cgroup_id;
    event->runq_lat       = runq_lat;
    event->ts             = now;
    bpf_ringbuf_submit(event, 0);
    bpf_map_update_elem(&cgroup_id_to_last_event_ts, &cgroup_id,
                        &now, BPF_ANY);
}

Tuning knobs

  • RATE_LIMIT_NS. The sampling window. Lower = more fidelity + more CPU; higher = lower load + coarser samples. Netflix doesn't disclose their value. Sane starting point is "percentile-timer export interval ÷ target samples per percentile."
  • MAX_TASK_ENTRIES. Bounds memory. Pick well above the expected concurrent cgroup count; excess entries are evicted via the map's default eviction policy.
  • Dimension choice. For other observability use cases, the key could be (pid), (cpu), (sock_cookie), etc. Match the key to the aggregation dimension of the output metric: you're keeping one sample per output-bucket per window.

Tradeoffs & caveats

  • Burst under-counting. A cgroup experiencing a very bursty noisy-neighbor event (e.g. 10ms of high preemption) will yield at most 10ms / RATE_LIMIT_NS samples — if too coarse, the percentile estimator may miss the burst's tail.
  • Cold-cgroup first-event latency. First event for a cgroup has last_ts = 0, so it is always emitted; this is correct — the first event is precisely where you want signal.
  • Eviction races. A cgroup evicted from the map between events re-enters as a "cold" cgroup, which is usually fine.
  • Not the same as sampling randomly. Time-window rate limit biases samples toward the start of each window; if that matters for your statistic, use a per-event coin-flip instead.
  • Userspace version of the same idea: concepts/rate-limited-cache is the analogous shape at the application layer.
  • Without the per-dimension key, you have simple uniform sampling — strictly worse for attribution-dimensioned output metrics because hot cgroups drown out cold ones.

Seen in

Last updated · 319 distilled / 1,201 read