Skip to content

PATTERN Cited by 1 source

Signal-triggered heap snapshot-diff

Problem

You suspect a memory leak in a long-running server process in production. Restarting the process loses the leaking state. Running a full heap profiler continuously is too expensive. External tools (core dumps, gdb) are heavy and disruptive, and capture the heap at native-memory granularity — not at the Python / Ruby / language level where the leak actually lives.

Solution

Install a custom signal handler in the running server, then drive an on-demand profiler from a state machine across two signal deliveries:

  1. Signal #1 → start tracing; take baseline snapshot; yield (pause the state machine until the next signal).
  2. Signal #2 → take second snapshot; diff against baseline; dump the ranked allocation-growth list to a file; stop tracing.

Between signal #1 and signal #2 the process continues to serve normal production traffic, only paying the tracing overhead for the capture window. The diff output shows the growth between the two points in time — if the service is leaking, the leaking allocation site floats to the top of the list.

# pseudocode after Lyft's MemoryProfiler
class MemoryProfiler:
    def __init__(self):
        self._state_machine = self._profiling_state_machine()

    def register_handlers(self):
        signal.signal(signal.SIGUSR2, self.handle_signal)

    def handle_signal(self, signum, frame):
        next(self._state_machine)

    def _profiling_state_machine(self):
        while True:
            try:
                tracemalloc.start()
                snap1 = tracemalloc.take_snapshot()
                yield
                snap2 = tracemalloc.take_snapshot()
                dump_top_diff(snap2.compare_to(snap1, "lineno"))
            finally:
                if tracemalloc.is_tracing():
                    tracemalloc.stop()

The generator idiom is load-bearing — it keeps the signal handler fully side-effect-free (just next(...) the iterator) while keeping the "started tracing and took snap1" state alive across signal deliveries.

Operational workflow

  1. kubectl exec (or ssh) into the target pod / host.
  2. ps aux | grep <svc> → identify the worker PID of interest.
  3. kill -USR2 <pid> → first signal, tracing begins.
  4. Let the process serve traffic for T seconds (however long it takes for the leak to manifest).
  5. kill -USR2 <pid> → second signal, diff is captured and dumped.
  6. Retrieve the dump file; rank by allocation-growth; investigate the top N lines of Python code.

Why this shape

  • No process restart. The process is the subject of the investigation; restarting it destroys the state you want to profile.
  • Zero-cost when idle. Between capture windows the profiler is just a signal handler — a pointer in the kernel-managed signal-disposition table. No allocator hooks, no measurement overhead.
  • Targeted capture window. You pay the tracing cost only for the time you're willing to pay it; production SLOs survive.
  • Language-level attribution. The output is "line X of file Y grew by N bytes" — directly actionable, unlike gdb-level output.
  • Scales down trivially. A single engineer with kubectl exec can do this. No fleet-wide profiling infrastructure required.

Gotchas

  • Signal handlers must be registered post-fork in pre-fork servers like systems/gunicorn. Registering at import time under preload=True means the handler is installed in the leader and — depending on how the supervisor resets signals in workers — may not be active in the worker that receives kill -USR2. If absent, the default disposition for SIGUSR1 / SIGUSR2 is terminate, so the "profiling signal" kills the process. Lyft hit this exactly. See concepts/signal-handler-fork-inheritance and concepts/pre-fork-copy-on-write.
  • Signal-safety rules for the handler body. POSIX restricts what you can do inside a signal handler (async-signal-safe functions only). Python papers over this via PyErr_SetInterrupt — handlers run at the next bytecode boundary, not inside the signal's own execution context — but the handler body should still be trivial. The Lyft pattern keeps it to next(state_machine), which is effectively free.
  • Don't accidentally nest captures. If the state machine yields and a second "start" signal arrives before the "capture" signal, you leak state. The Lyft generator simply re-loops, which restarts tracing and discards snap1; alternative implementations might latch into a state variable and reject re-entry.
  • Allocator coverage gaps. Language-level tracers (tracemalloc in Python, ObjectSpace allocation-tracing in Ruby) see only allocations that go through the language runtime's allocator. C-extension heap (NumPy arrays, PyTorch tensors) is invisible. Pair with a native-heap tool if the leak might be below the language line.

Contrast with

  • Stack-trace sampling profiling — same philosophy ("capture in production, pay only for the capture window"), different axis (CPU, not heap). Both patterns work well side by side.
  • Continuous allocator hooks (jemalloc's always-on profiling) — trades zero-cost idle for zero-cost capture; useful in different operational postures.
  • Core-dump analysis — captures everything at one instant but at higher operational cost and without Python-level attribution.

Seen in

Last updated · 319 distilled / 1,201 read