Lyft — From Python 3.8 to Python 3.10: Our Journey Through a Memory Leak¶
Summary¶
A Lyft engineer (Jay Patel) walks through a production memory-leak
hunt on a Python service after its Python 3.8 → Python 3.10 upgrade.
The post's published body focuses on the tooling the team built to
chase the leak — a lightweight, in-process memory profiler based on
the CPython standard-library tracemalloc module, activated by
sending SIGUSR2 to a running gunicorn worker in a Kubernetes
pod. The first real attempt to capture a trace killed the worker;
debugging why uncovered a subtle but general footgun at the
intersection of gunicorn's preload=True option, Linux fork +
copy-on-write semantics, and POSIX signal-handler inheritance.
The article is an operational primer on signal-driven heap profiling
in a pre-fork Python HTTP server, not a postmortem of the eventual
leak itself (the body cuts off mid-investigation).
Key takeaways¶
-
Pre-fork WSGI servers share memory via copy-on-write. Gunicorn with
preload=Trueimports the application once in the leader process; forked workers inherit the pages read-only and only copy on write. Lyft observed worker PSS (proportional set size) drop from ~203 MB (no preload) to ~41 MB (preload) — nearly 5× — because the imports-and-application-object mass is now shared. (Source: body + twosmemscreenshots.) -
A signal handler registered at import time is owned by whoever ran the import. With
preload=True, thesignal.signal(SIGUSR2, handler)call executes in the leader process before fork. The forked workers inherit the handler table as-is from the leader — but Lyft's worker experiencedkill -USR2 <worker-pid>terminating the worker process anyway. The post frames this as "the worker process did not register due to copy-on-write" and "that causes anykill -USR2to actually kill the process", attributing the kill to the worker not having re-registered its own signal handler after fork. The lesson is operational even if the underlying mechanism is more subtle than COW strictly implies (see caveats): register signal handlers post-fork, not at import time, if you need per-worker in-process tooling underpreload=True. -
tracemallocis production-viable as an on-demand profiler. Lyft wraps it in a generator-based state machine driven by signals: firstSIGUSR2→tracemalloc.start()+ capturesnapshot1; secondSIGUSR2→ capturesnapshot2, diff againstsnapshot1, dump the ranked top-allocation-growth lines to a file, thentracemalloc.stop(). The cost of tracing is paid only during the capture window, between the two signals. (Source:MemoryProfilerpseudocode +_profiling_state_machinegenerator.) -
The
USR2/SIGUSR2convention is intentional.SIGUSR1andSIGUSR2are POSIX "user-defined" signals that the kernel does not use; they exist precisely so application code can own them. Gunicorn itself already definesUSR2semantics on the leader process (in-place upgrade), which is part of why Lyft targeted the worker PID directly viakill -USR2 <worker-pid>and why, when the worker didn't in fact own a handler, the default disposition (terminate) applied. -
Debugging the profiler took longer than expected. The author mentions "several hours of debugging" before the connection to
preloadwas made — emphasising that the failure mode (tracing the process kills the process) is non-obvious and that the link to gunicorn's pre-fork model is easy to miss if you think ofpreloadpurely as a memory optimisation. -
Python 3.8 → 3.10 was the triggering upgrade. The title frames the journey explicitly as "From Python 3.8 to Python 3.10" — the leak appeared after the version bump. The published body does not identify the root cause of the leak itself (the post is truncated before that section), so the post is cited here for the profiler design and the preload footgun, not for a resolved Python 3.10 regression.
Architectural primitives extracted¶
- Gunicorn — Python WSGI pre-fork HTTP
server. Leader process forks N workers; two fork modes:
no-preload (workers each import the app) vs.
preload=True(leader imports once, workers inherit via COW). Runs under Kubernetes at Lyft. - tracemalloc — CPython stdlib module that tracks memory allocations by traceback. Supports snapshotting and snapshot-diff; per-frame granularity; intended for debugging, not always-on observability. Lyft's profiler is a thin wrapper.
- Pre-fork copy-on-write —
the kernel-level mechanism
preload=Trueexploits. Forked child process shares the parent's pages read-only; the kernel copies a page only when the child writes to it. Saved Lyft ~162 MB PSS per worker. Central to the memory win and the signal footgun. - Signal-handler fork
inheritance — the concept. A signal handler installed
before
fork(2)is inherited by the child process according to POSIX rules, but application frameworks (notably gunicorn) may reset handlers in the worker post-fork. Registering inside an import that runs pre-fork withpreload=Trueis therefore ambiguous: the handler table is inherited, but whoever comes along after fork (the gunicorn worker-startup code) can replace it, leaving the intended application handler absent in the worker. - Signal-triggered heap snapshot-diff — the pattern. Register a custom signal handler in a long-running server; on first signal, start tracing
- take baseline; on second signal, take second snapshot, diff against baseline, dump ranked allocation deltas; stop tracing. Pays profiling cost only for the capture window. Generator / state-machine implementation keeps the handler side-effect-free.
Operational numbers from the post¶
- Worker PSS, no preload: ~203 MB.
- Worker PSS, with preload: ~41 MB (−~80 %, ~5× reduction).
- "Several hours" to localise the problem to
preload. - No disclosed leak size, leak source, or fixed p99 numbers — the published body stops at the preload discovery and does not reach resolution.
Caveats¶
- Body is truncated. The published post ends at "the worker
process did not register due to copy-on-write and that causes
any
kill -USR2to actually kill the process". Whatever the team did next (patch: register post-fork hook; patch: use gunicorn'spost_forkworker hook; patch: stop using preload for this service) is not in the ingested markdown. Mitigation patterns below are inferred from the problem statement, not quoted from Lyft. - COW is not strictly the reason the handler "didn't register".
POSIX says a forked child inherits the parent's signal
dispositions (
man 2 fork: "the child inherits … the set of signal handlers … from its parent"). Gunicorn's worker initialisation explicitly resets signals to defaults for a worker before invoking user hooks — so in practice what Lyft is describing is gunicorn's post-fork reset overwriting the pre-fork handler, not the COW mechanism itself failing to propagate it. The operational takeaway (register post-fork) is the same; the mechanistic framing in the post is slightly loose. Flagged⚠️ mechanism-framingrather thancontradiction— the fix is identical either way. - No postmortem on the actual 3.8→3.10 leak. Do not cite this source for a Python 3.10 memory-leak regression; cite it for the profiler design and the preload footgun only.
- Tier-2 Lyft source; in scope for the wiki because it documents a production memory-debugging tool + a general pre-fork server footgun, both of which generalise well beyond Python.
Source¶
- Original: https://eng.lyft.com/from-python3-8-to-python3-10-our-journey-through-a-memory-leak-1fd9b43cc01e?source=rss----25cd379abb8---4
- Raw markdown:
raw/lyft/2025-12-15-from-python38-to-python310-our-journey-through-a-memory-leak-1791ba2e.md
Related¶
- systems/gunicorn — pre-fork WSGI server at the centre of the footgun.
- systems/tracemalloc — CPython stdlib heap tracer the profiler wraps.
- concepts/pre-fork-copy-on-write — the mechanism that makes preload a memory win.
- concepts/signal-handler-fork-inheritance — the mechanism that makes preload a signal-handler footgun.
- patterns/signal-triggered-heap-snapshot-diff — the reusable on-demand profiling pattern.
- concepts/gil-contention — adjacent Python-serving hazard;
different failure mode, same
python + gunicorn + k8sdeployment shape. - companies/lyft — company page.