PATTERN Cited by 1 source
Scheduler tracepoint-based monitoring¶
Attach an eBPF program to a pair of Linux scheduler tracepoints that bracket a state transition; record a timestamp at the leading edge keyed by task identifier; subtract at the trailing edge to derive the queueing / dispatch / migration latency; emit the (enriched) delta to userspace via a ring buffer.
This is the canonical shape for any scheduler-queue observability metric on Linux — the primitive that turns invisible in-kernel queueing into a first-class per-task / per-cgroup metric.
Shape¶
┌───────────────────────┐ ┌────────────────────────┐
│ leading tracepoint │ │ trailing tracepoint │
│ (state transition │ │ (state transition │
│ that starts the │ │ that ends the │
│ measured interval) │ │ interval) │
└───────────┬───────────┘ └───────────┬────────────┘
│ bpf_map_update_elem: │ bpf_map_lookup_elem:
│ key = pid, val = ktime │ key = pid → ts
│ │ delta = now - ts
│ │ bpf_map_delete_elem
▼ ▼
┌──────────────────────────────────────┐
│ BPF_MAP_TYPE_HASH │
│ key=u32 pid val=u64 ts │
│ size bounded by MAX_TASK_ENTRIES │
└──────────────────────────────────────┘
│
│ enrich: cgroup_id, preempt-cause, cpu,
│ kind, prev_cgroup_id, ...
▼
(optional) per-cgroup-per-CPU rate limiter
│ [patterns/per-cgroup-rate-limiting-in-ebpf](<./per-cgroup-rate-limiting-in-ebpf.md>)
▼
BPF_MAP_TYPE_RINGBUF
│
▼
userspace agent
Netflix's instance — run queue latency¶
Tracepoints:
- Leading:
tp_btf/sched_wakeupandtp_btf/sched_wakeup_new— fire when a task becomes runnable. Timestamp keyed bypid. - Trailing:
tp_btf/sched_switch— fires when a CPU switches between tasks. Look up the incoming task's PID, subtract, emit.
SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx)
{
struct task_struct *task = (void *)ctx[0];
u32 pid = task->pid;
u64 ts = bpf_ktime_get_ns();
bpf_map_update_elem(&runq_enqueued, &pid, &ts, BPF_NOEXIST);
return 0;
}
SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
struct task_struct *next = (struct task_struct *)ctx[2];
u32 next_pid = next->pid;
u64 *tsp = bpf_map_lookup_elem(&runq_enqueued, &next_pid);
if (!tsp) return 0; // missed the enqueue
u64 runq_lat = bpf_ktime_get_ns() - *tsp;
bpf_map_delete_elem(&runq_enqueued, &next_pid);
// ... enrich + rate-limit + ringbuf submit ...
return 0;
}
Output: per-container runq.latency percentile histograms — see
concepts/run-queue-latency.
Why this pattern wins over the alternatives¶
- vs.
perf sched/ftracewith userspace post-processing. No kernel → userspace hop per event; no trace-buffer loss at high event rates; no offline analysis lag; computation happens beside the event. - vs. a custom kernel module. eBPF programs are verifier-gated, so a bug fails load rather than panics the host. Critical for fleet deployment — see systems/netflix-titus.
- vs. application-level latency histograms. Application latency can't distinguish scheduler queueing from lock contention / I/O / GC. Scheduler tracepoints measure queueing directly, at the exact layer that noisy neighbors live.
- vs. polling
/proc/<pid>/schedstat. Polling misses microsecond- scale spikes; tracepoints are event-driven at the exact transition.
Variations¶
- Different tracepoint pairs measure different things.
sched_wakeup+sched_switch(incoming task) → run queue latency (this pattern's canonical instance).sched_switch(outgoing task) +sched_wakeup(same task) → sleep duration / off-CPU time.sched_migrate_task→ migration events (per cgroup / per NUMA node).sched_process_exec/sched_process_fork→ task lifetime accounting.
- Tracepoint type.
tp_btf/(BTF-typed) is preferred over raw tracepoints — the BPF program receives typedtask_struct *pointers, not opaque context arrays, so the code reads as plain C. - Key choice. PID is the natural key; for migration latency the
key becomes
(task_struct *)(same task across CPUs). concepts/cgroup-id is an enrichment dimension, not the key.
Implementation discipline¶
- Always clean up the map entry after consumption.
bpf_map_delete_elemon the lookup-and-subtract path, or the map fills up with orphaned PIDs that never transitioned to the trailing state (races with process exit, unlikely wake-without-switch sequences, etc.). - Use
BPF_NOEXISTon update. If a task is somehow re-enqueued before dispatch, don't clobber the older timestamp — the first one is when queueing actually started. - Budget the map size.
MAX_TASK_ENTRIESbounds memory; pick it above expected concurrent runnable-task count. - Prefer ring buffer over perf event array. Variable-length records, no per-CPU buffer, no copy-to-userspace syscall — but its throughput ceiling is the consumer, not the producer, so pair with in-kernel rate limiting (patterns/per-cgroup-rate-limiting-in-ebpf).
- Extract attribution dimensions on the trailing edge, where
the
task_structis live. Pullcgroup_id, preempt-cause,cpu, etc. in thesched_switchhandler and emit them in the ringbuf record, so the userspace agent doesn't need to re-look them up and race against state changes.
Seen in¶
- sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf —
Netflix's
runq.latencymonitor — canonical instance.sched_wakeup sched_switchtracepoints, PID-keyed hash map, per-cgroup rate limiter, ring buffer to Go agent that emits Atlas percentile timers.