Skip to content

PATTERN Cited by 1 source

Scheduler tracepoint-based monitoring

Attach an eBPF program to a pair of Linux scheduler tracepoints that bracket a state transition; record a timestamp at the leading edge keyed by task identifier; subtract at the trailing edge to derive the queueing / dispatch / migration latency; emit the (enriched) delta to userspace via a ring buffer.

This is the canonical shape for any scheduler-queue observability metric on Linux — the primitive that turns invisible in-kernel queueing into a first-class per-task / per-cgroup metric.

Shape

┌───────────────────────┐        ┌────────────────────────┐
│ leading tracepoint    │        │ trailing tracepoint    │
│ (state transition     │        │ (state transition      │
│  that starts the      │        │  that ends the         │
│  measured interval)   │        │  interval)             │
└───────────┬───────────┘        └───────────┬────────────┘
            │ bpf_map_update_elem:           │ bpf_map_lookup_elem:
            │   key = pid, val = ktime       │   key = pid → ts
            │                                 │ delta = now - ts
            │                                 │ bpf_map_delete_elem
            ▼                                 ▼
         ┌──────────────────────────────────────┐
         │  BPF_MAP_TYPE_HASH                    │
         │    key=u32 pid  val=u64 ts            │
         │    size bounded by MAX_TASK_ENTRIES   │
         └──────────────────────────────────────┘
                            │  enrich: cgroup_id, preempt-cause, cpu,
                            │           kind, prev_cgroup_id, ...
               (optional) per-cgroup-per-CPU rate limiter
                            │  [patterns/per-cgroup-rate-limiting-in-ebpf](<./per-cgroup-rate-limiting-in-ebpf.md>)
                    BPF_MAP_TYPE_RINGBUF
                       userspace agent

Netflix's instance — run queue latency

Tracepoints:

  • Leading: tp_btf/sched_wakeup and tp_btf/sched_wakeup_new — fire when a task becomes runnable. Timestamp keyed by pid.
  • Trailing: tp_btf/sched_switch — fires when a CPU switches between tasks. Look up the incoming task's PID, subtract, emit.
SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx)
{
    struct task_struct *task = (void *)ctx[0];
    u32 pid = task->pid;
    u64 ts = bpf_ktime_get_ns();
    bpf_map_update_elem(&runq_enqueued, &pid, &ts, BPF_NOEXIST);
    return 0;
}

SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
    struct task_struct *next = (struct task_struct *)ctx[2];
    u32 next_pid = next->pid;
    u64 *tsp = bpf_map_lookup_elem(&runq_enqueued, &next_pid);
    if (!tsp) return 0;                 // missed the enqueue
    u64 runq_lat = bpf_ktime_get_ns() - *tsp;
    bpf_map_delete_elem(&runq_enqueued, &next_pid);
    // ... enrich + rate-limit + ringbuf submit ...
    return 0;
}

Output: per-container runq.latency percentile histograms — see concepts/run-queue-latency.

Why this pattern wins over the alternatives

  • vs. perf sched / ftrace with userspace post-processing. No kernel → userspace hop per event; no trace-buffer loss at high event rates; no offline analysis lag; computation happens beside the event.
  • vs. a custom kernel module. eBPF programs are verifier-gated, so a bug fails load rather than panics the host. Critical for fleet deployment — see systems/netflix-titus.
  • vs. application-level latency histograms. Application latency can't distinguish scheduler queueing from lock contention / I/O / GC. Scheduler tracepoints measure queueing directly, at the exact layer that noisy neighbors live.
  • vs. polling /proc/<pid>/schedstat. Polling misses microsecond- scale spikes; tracepoints are event-driven at the exact transition.

Variations

  • Different tracepoint pairs measure different things.
    • sched_wakeup + sched_switch (incoming task) → run queue latency (this pattern's canonical instance).
    • sched_switch (outgoing task) + sched_wakeup (same task) → sleep duration / off-CPU time.
    • sched_migrate_taskmigration events (per cgroup / per NUMA node).
    • sched_process_exec / sched_process_forktask lifetime accounting.
  • Tracepoint type. tp_btf/ (BTF-typed) is preferred over raw tracepoints — the BPF program receives typed task_struct * pointers, not opaque context arrays, so the code reads as plain C.
  • Key choice. PID is the natural key; for migration latency the key becomes (task_struct *) (same task across CPUs). concepts/cgroup-id is an enrichment dimension, not the key.

Implementation discipline

  1. Always clean up the map entry after consumption. bpf_map_delete_elem on the lookup-and-subtract path, or the map fills up with orphaned PIDs that never transitioned to the trailing state (races with process exit, unlikely wake-without-switch sequences, etc.).
  2. Use BPF_NOEXIST on update. If a task is somehow re-enqueued before dispatch, don't clobber the older timestamp — the first one is when queueing actually started.
  3. Budget the map size. MAX_TASK_ENTRIES bounds memory; pick it above expected concurrent runnable-task count.
  4. Prefer ring buffer over perf event array. Variable-length records, no per-CPU buffer, no copy-to-userspace syscall — but its throughput ceiling is the consumer, not the producer, so pair with in-kernel rate limiting (patterns/per-cgroup-rate-limiting-in-ebpf).
  5. Extract attribution dimensions on the trailing edge, where the task_struct is live. Pull cgroup_id, preempt-cause, cpu, etc. in the sched_switch handler and emit them in the ringbuf record, so the userspace agent doesn't need to re-look them up and race against state changes.

Seen in

Last updated · 319 distilled / 1,201 read