PATTERN Cited by 1 source

Scheduler tracepoint-based monitoring¶

Attach an eBPF program to a pair of Linux scheduler tracepoints that bracket a state transition; record a timestamp at the leading edge keyed by task identifier; subtract at the trailing edge to derive the queueing / dispatch / migration latency; emit the (enriched) delta to userspace via a ring buffer.

This is the canonical shape for any scheduler-queue observability metric on Linux — the primitive that turns invisible in-kernel queueing into a first-class per-task / per-cgroup metric.

Shape¶

┌───────────────────────┐        ┌────────────────────────┐
│ leading tracepoint    │        │ trailing tracepoint    │
│ (state transition     │        │ (state transition      │
│  that starts the      │        │  that ends the         │
│  measured interval)   │        │  interval)             │
└───────────┬───────────┘        └───────────┬────────────┘
            │ bpf_map_update_elem:           │ bpf_map_lookup_elem:
            │   key = pid, val = ktime       │   key = pid → ts
            │                                 │ delta = now - ts
            │                                 │ bpf_map_delete_elem
            ▼                                 ▼
         ┌──────────────────────────────────────┐
         │  BPF_MAP_TYPE_HASH                    │
         │    key=u32 pid  val=u64 ts            │
         │    size bounded by MAX_TASK_ENTRIES   │
         └──────────────────────────────────────┘
                            │
                            │  enrich: cgroup_id, preempt-cause, cpu,
                            │           kind, prev_cgroup_id, ...
                            ▼
               (optional) per-cgroup-per-CPU rate limiter
                            │  [patterns/per-cgroup-rate-limiting-in-ebpf](<./per-cgroup-rate-limiting-in-ebpf.md>)
                            ▼
                    BPF_MAP_TYPE_RINGBUF
                            │
                            ▼
                       userspace agent

Netflix's instance — run queue latency¶

Tracepoints:

Leading: tp_btf/sched_wakeup and tp_btf/sched_wakeup_new — fire when a task becomes runnable. Timestamp keyed by pid.
Trailing: tp_btf/sched_switch — fires when a CPU switches between tasks. Look up the incoming task's PID, subtract, emit.

SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx)
{
    struct task_struct *task = (void *)ctx[0];
    u32 pid = task->pid;
    u64 ts = bpf_ktime_get_ns();
    bpf_map_update_elem(&runq_enqueued, &pid, &ts, BPF_NOEXIST);
    return 0;
}

SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
    struct task_struct *next = (struct task_struct *)ctx[2];
    u32 next_pid = next->pid;
    u64 *tsp = bpf_map_lookup_elem(&runq_enqueued, &next_pid);
    if (!tsp) return 0;                 // missed the enqueue
    u64 runq_lat = bpf_ktime_get_ns() - *tsp;
    bpf_map_delete_elem(&runq_enqueued, &next_pid);
    // ... enrich + rate-limit + ringbuf submit ...
    return 0;
}

Output: per-container runq.latency percentile histograms — see concepts/run-queue-latency.

Why this pattern wins over the alternatives¶

vs. perf sched / ftrace with userspace post-processing. No kernel → userspace hop per event; no trace-buffer loss at high event rates; no offline analysis lag; computation happens beside the event.
vs. a custom kernel module. eBPF programs are verifier-gated, so a bug fails load rather than panics the host. Critical for fleet deployment — see systems/netflix-titus.
vs. application-level latency histograms. Application latency can't distinguish scheduler queueing from lock contention / I/O / GC. Scheduler tracepoints measure queueing directly, at the exact layer that noisy neighbors live.
vs. polling /proc/<pid>/schedstat. Polling misses microsecond- scale spikes; tracepoints are event-driven at the exact transition.

Variations¶

Different tracepoint pairs measure different things.
- sched_wakeup + sched_switch (incoming task) → run queue latency (this pattern's canonical instance).
- sched_switch (outgoing task) + sched_wakeup (same task) → sleep duration / off-CPU time.
- sched_migrate_task → migration events (per cgroup / per NUMA node).
- sched_process_exec / sched_process_fork → task lifetime accounting.
Tracepoint type. tp_btf/ (BTF-typed) is preferred over raw tracepoints — the BPF program receives typed task_struct * pointers, not opaque context arrays, so the code reads as plain C.
Key choice. PID is the natural key; for migration latency the key becomes (task_struct *) (same task across CPUs). concepts/cgroup-id is an enrichment dimension, not the key.

Implementation discipline¶

Always clean up the map entry after consumption. bpf_map_delete_elem on the lookup-and-subtract path, or the map fills up with orphaned PIDs that never transitioned to the trailing state (races with process exit, unlikely wake-without-switch sequences, etc.).
Use BPF_NOEXIST on update. If a task is somehow re-enqueued before dispatch, don't clobber the older timestamp — the first one is when queueing actually started.
Budget the map size. MAX_TASK_ENTRIES bounds memory; pick it above expected concurrent runnable-task count.
Prefer ring buffer over perf event array. Variable-length records, no per-CPU buffer, no copy-to-userspace syscall — but its throughput ceiling is the consumer, not the producer, so pair with in-kernel rate limiting (patterns/per-cgroup-rate-limiting-in-ebpf).
Extract attribution dimensions on the trailing edge, where the task_struct is live. Pull cgroup_id, preempt-cause, cpu, etc. in the sched_switch handler and emit them in the ringbuf record, so the userspace agent doesn't need to re-look them up and race against state changes.

Seen in¶

sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — Netflix's runq.latency monitor — canonical instance. sched_wakeup
sched_switch tracepoints, PID-keyed hash map, per-cgroup rate limiter, ring buffer to Go agent that emits Atlas percentile timers.