PATTERN Cited by 1 source
Per-cgroup rate limiting in eBPF¶
Rate-limit event emission from an eBPF program per attribution
dimension (e.g. cgroup ID) per CPU, inside the kernel, by
maintaining a (cgroup_id, cpu) → last_ts map and comparing the
current timestamp against the stored one before doing any further
work — critically, before allocating space in the output ring buffer.
The pattern makes high-frequency in-kernel observability sustainable by dropping redundant events at the source, keyed by the dimension that will become the output metric's tag.
Motivation¶
Linux scheduler tracepoints (sched_wakeup, sched_switch) fire
millions of times per second on a busy multi-core host. An eBPF
program that forwards every event to userspace via a ring buffer
will:
- Burn kernel ring-buffer bandwidth for events whose information is duplicated by neighbours microseconds away.
- Saturate the userspace consumer's CPU with redundant samples.
- Deliver no additional signal beyond what a sampled stream already gives.
"The sheer number of data points was causing the userspace program to use too much CPU, so we implemented a rate limiter in eBPF to sample the data." (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf)
Shape¶
Leading tracepoint (sched_switch, after timestamp subtract)
│
▼
┌──────────────────────────────────────────────────────────┐
│ last_ts = map_lookup(cgroup_id_to_last_event_ts, │
│ &cgroup_id); │
│ if (now - last_ts < RATE_LIMIT_NS) │
│ return 0; // drop — cheap, no allocation │
└──────────────────────────────────────┬───────────────────┘
│
▼
bpf_ringbuf_reserve(...)
fill event fields
bpf_ringbuf_submit(...)
map_update(last_ts, cgroup_id, now)
│
▼
user-space ring buffer
Why a PERCPU_HASH, not a plain HASH¶
Netflix keys the limiter last_ts map as
BPF_MAP_TYPE_PERCPU_HASH — each CPU has its own view of the
cgroup_id → last_ts state. This is deliberate:
- No concurrent writes across CPUs — eliminates the BPF atomic /
bpf_spin_lockcost on every hot event. - Per-CPU rate bucketing matches the metric's shape. The output
runq.latencypercentile is per-container across the whole host; preserving one sample per cgroup per CPU still gives the percentile estimator adequate points. - Sampling independence across cores. A hot cgroup running on many cores will still emit at least one sample per core per window, which matches where the information diversity is.
A single shared HASH would either require locking (contention on
hot cgroups, the exact cgroups we most want to measure) or race
(silently missing the rate limit on concurrent events).
Why the check happens before ringbuf_reserve¶
Placing the check before the ring-buffer reserve is load-bearing:
- No ring-buffer bytes are consumed for dropped events. Reserve
- submit + discard still imposes ring-buffer pressure; a pre-check is free.
- The eBPF program short-circuits cheaply. A map lookup + comparison is ~nanoseconds; an unnecessary reserve path is much more expensive.
- It gates all subsequent work, including any field
enrichment (cgroup-id resolution for the previous task in
sched_switch, timestamp derivation, etc.).
Reference code¶
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u64));
__uint(value_size, sizeof(u64));
} cgroup_id_to_last_event_ts SEC(".maps");
// Inside sched_switch handler, after computing runq_lat:
u64 *last_ts = bpf_map_lookup_elem(&cgroup_id_to_last_event_ts,
&cgroup_id);
u64 last_ts_val = last_ts ? *last_ts : 0;
if (now - last_ts_val < RATE_LIMIT_NS)
return 0; // drop
struct runq_event *event =
bpf_ringbuf_reserve(&events, sizeof(*event), 0);
if (event) {
event->prev_cgroup_id = prev_cgroup_id;
event->cgroup_id = cgroup_id;
event->runq_lat = runq_lat;
event->ts = now;
bpf_ringbuf_submit(event, 0);
bpf_map_update_elem(&cgroup_id_to_last_event_ts, &cgroup_id,
&now, BPF_ANY);
}
Tuning knobs¶
RATE_LIMIT_NS. The sampling window. Lower = more fidelity + more CPU; higher = lower load + coarser samples. Netflix doesn't disclose their value. Sane starting point is "percentile-timer export interval ÷ target samples per percentile."MAX_TASK_ENTRIES. Bounds memory. Pick well above the expected concurrent cgroup count; excess entries are evicted via the map's default eviction policy.- Dimension choice. For other observability use cases, the key
could be
(pid),(cpu),(sock_cookie), etc. Match the key to the aggregation dimension of the output metric: you're keeping one sample per output-bucket per window.
Tradeoffs & caveats¶
- Burst under-counting. A cgroup experiencing a very bursty
noisy-neighbor event (e.g. 10ms of high preemption) will yield
at most
10ms / RATE_LIMIT_NSsamples — if too coarse, the percentile estimator may miss the burst's tail. - Cold-cgroup first-event latency. First event for a cgroup has
last_ts = 0, so it is always emitted; this is correct — the first event is precisely where you want signal. - Eviction races. A cgroup evicted from the map between events re-enters as a "cold" cgroup, which is usually fine.
- Not the same as sampling randomly. Time-window rate limit biases samples toward the start of each window; if that matters for your statistic, use a per-event coin-flip instead.
Related primitives on the wiki¶
- Userspace version of the same idea: concepts/rate-limited-cache is the analogous shape at the application layer.
- Without the per-dimension key, you have simple uniform sampling — strictly worse for attribution-dimensioned output metrics because hot cgroups drown out cold ones.
Seen in¶
- sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf —
canonical instance. Per-cgroup-per-CPU hash keyed by cgroup ID,
window
RATE_LIMIT_NS, check placed beforebpf_ringbuf_reserve. Motivated by userspace CPU saturation on the raw event stream.