Skip to content

CONCEPT Cited by 1 source

Stack-trace sampling profiling

Stack-trace sampling profiling is a production profiling technique in which a profiler periodically (e.g., 100 Hz) takes a snapshot of the call stack on each running thread, then estimates per-function CPU utilization as:

function CPU % ≈ (samples containing the function) / (total samples)

Why it works

If a function F is on 19 out of 1,111 sampled stacks, then the process was executing somewhere inside F ~1.71 % of the time — a direct estimate of F's CPU share. The estimate converges to the true share as sample count grows.

Why it matters at scale

At CDN scales:

  • systems/pingora-origin: 40,000 saturated CPU cores globally. 1 % of CPU = 400 cores. Helper functions contributing 1-2 % become worthwhile optimization targets.
  • Without stack-trace sampling you'd never flag a one-line header-cleanup helper as worth rewriting. With it, the target surfaces itself.

Operational properties

  • Very low overhead (eBPF / perf / linux-perf / Datadog Continuous Profiler) — a few kHz of sampling is negligible.
  • Production-representative — works against real traffic patterns, not synthetic benches.
  • No code changes needed — no manual span / trace instrumentation; the profiler hooks the kernel scheduler or signal / eBPF primitives.
  • Statistical — noisy in the short run; trust the converged numbers over minutes-to-hours.

Closing the loop with microbenchmarks

The canonical Cloudflare pattern is flame-graph → criterion microbench → production-sampling verification:

  1. Stack-trace sampling identifies the function worth optimizing (CPU share above the threshold).
  2. Criterion microbench measures the candidate fixes in isolation at nanosecond resolution.
  3. Predicted CPU % is extrapolated linearly from microbench timings.
  4. Ship the winner; re-sample production; verify predicted vs measured match. If they do, the methodology is trustworthy for the next optimization.

The 2024-09-10 trie-hard rollout matched predicted CPU (0.43 %) against measured CPU (0.34 %) within ~0.1 % — tight enough to trust criterion as the decision substrate for the next helper (sources/2024-09-10-cloudflare-a-good-day-to-trie-hard).

Seen in

  • sources/2024-09-10-cloudflare-a-good-day-to-trie-hard — the per-service / per-team altitude: Cloudflare's Pingora- Origin team uses 100 Hz stack-trace sampling to find 1-2 % helper functions worth rewriting (1 % = 400 cores at Cloudflare scale) + criterion-verify.
  • sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technologyhyperscale / fleet-orchestrated altitude. Meta's Strobelight runs 42+ profilers (many stack-trace-sampling-based) on every Meta production host. Canonical fleet-orchestrated instance on this wiki — adds dynamic sampling rate tuning (daily re-tune of run probability + weighted aggregation for valid cross-host + cross-service comparison) as the mechanism that makes cross-service "horizontal efficiency wins" tractable. Canonical economic datum: the LBR-based sampling profiler feeds the FDO pipeline → up to 20% CPU-cycles reduction on Meta's top 200 services. A single-ampersand (&) fix spotted via per-file- line-symbolized std::vector stack filtering saved ~15,000 servers/year on one ads service.
Last updated · 542 distilled / 1,571 read