PATTERN Cited by 1 source
Signal-triggered heap snapshot-diff¶
Problem¶
You suspect a memory leak in a long-running server process in
production. Restarting the process loses the leaking state. Running
a full heap profiler continuously is too expensive. External tools
(core dumps, gdb) are heavy and disruptive, and capture the heap
at native-memory granularity — not at the Python / Ruby / language
level where the leak actually lives.
Solution¶
Install a custom signal handler in the running server, then drive an on-demand profiler from a state machine across two signal deliveries:
- Signal #1 → start tracing; take baseline snapshot;
yield(pause the state machine until the next signal). - Signal #2 → take second snapshot; diff against baseline; dump the ranked allocation-growth list to a file; stop tracing.
Between signal #1 and signal #2 the process continues to serve normal production traffic, only paying the tracing overhead for the capture window. The diff output shows the growth between the two points in time — if the service is leaking, the leaking allocation site floats to the top of the list.
# pseudocode after Lyft's MemoryProfiler
class MemoryProfiler:
def __init__(self):
self._state_machine = self._profiling_state_machine()
def register_handlers(self):
signal.signal(signal.SIGUSR2, self.handle_signal)
def handle_signal(self, signum, frame):
next(self._state_machine)
def _profiling_state_machine(self):
while True:
try:
tracemalloc.start()
snap1 = tracemalloc.take_snapshot()
yield
snap2 = tracemalloc.take_snapshot()
dump_top_diff(snap2.compare_to(snap1, "lineno"))
finally:
if tracemalloc.is_tracing():
tracemalloc.stop()
The generator idiom is load-bearing — it keeps the signal handler
fully side-effect-free (just next(...) the iterator) while
keeping the "started tracing and took snap1" state alive across
signal deliveries.
Operational workflow¶
kubectl exec(orssh) into the target pod / host.ps aux | grep <svc>→ identify the worker PID of interest.kill -USR2 <pid>→ first signal, tracing begins.- Let the process serve traffic for T seconds (however long it takes for the leak to manifest).
kill -USR2 <pid>→ second signal, diff is captured and dumped.- Retrieve the dump file; rank by allocation-growth; investigate the top N lines of Python code.
Why this shape¶
- No process restart. The process is the subject of the investigation; restarting it destroys the state you want to profile.
- Zero-cost when idle. Between capture windows the profiler is just a signal handler — a pointer in the kernel-managed signal-disposition table. No allocator hooks, no measurement overhead.
- Targeted capture window. You pay the tracing cost only for the time you're willing to pay it; production SLOs survive.
- Language-level attribution. The output is "line X of file
Y grew by N bytes" — directly actionable, unlike
gdb-level output. - Scales down trivially. A single engineer with
kubectl execcan do this. No fleet-wide profiling infrastructure required.
Gotchas¶
- Signal handlers must be registered post-fork in pre-fork
servers like systems/gunicorn. Registering at import time
under
preload=Truemeans the handler is installed in the leader and — depending on how the supervisor resets signals in workers — may not be active in the worker that receiveskill -USR2. If absent, the default disposition forSIGUSR1/SIGUSR2is terminate, so the "profiling signal" kills the process. Lyft hit this exactly. See concepts/signal-handler-fork-inheritance and concepts/pre-fork-copy-on-write. - Signal-safety rules for the handler body. POSIX restricts
what you can do inside a signal handler (async-signal-safe
functions only). Python papers over this via
PyErr_SetInterrupt— handlers run at the next bytecode boundary, not inside the signal's own execution context — but the handler body should still be trivial. The Lyft pattern keeps it tonext(state_machine), which is effectively free. - Don't accidentally nest captures. If the state machine yields and a second "start" signal arrives before the "capture" signal, you leak state. The Lyft generator simply re-loops, which restarts tracing and discards snap1; alternative implementations might latch into a state variable and reject re-entry.
- Allocator coverage gaps. Language-level tracers (tracemalloc in Python, ObjectSpace allocation-tracing in Ruby) see only allocations that go through the language runtime's allocator. C-extension heap (NumPy arrays, PyTorch tensors) is invisible. Pair with a native-heap tool if the leak might be below the language line.
Contrast with¶
- Stack-trace sampling profiling — same philosophy ("capture in production, pay only for the capture window"), different axis (CPU, not heap). Both patterns work well side by side.
- Continuous allocator hooks (jemalloc's always-on profiling) — trades zero-cost idle for zero-cost capture; useful in different operational postures.
- Core-dump analysis — captures everything at one instant but at higher operational cost and without Python-level attribution.
Seen in¶
- sources/2025-12-15-lyft-from-python38-to-python310-memory-leak
— Lyft's
MemoryProfileris a canonical instance of the pattern, implemented on top of tracemalloc inside a gunicorn pre-fork worker. The article also serves as a cautionary tale about Step 0: where you register the signal handler.
Related¶
- systems/tracemalloc — the stdlib module the Python implementation typically wraps.
- systems/gunicorn — the server this pattern most commonly lives inside in Python shops.
- concepts/signal-handler-fork-inheritance — the registration footgun this pattern inevitably surfaces.
- concepts/pre-fork-copy-on-write — why
preload=Trueexists in the first place, which is what makes the footgun tempting. - patterns/measurement-driven-micro-optimization — the broader "capture in production, decide from data" discipline.