Skip to content

LYFT 2025-12-15

Read original ↗

Lyft — From Python 3.8 to Python 3.10: Our Journey Through a Memory Leak

Summary

A Lyft engineer (Jay Patel) walks through a production memory-leak hunt on a Python service after its Python 3.8 → Python 3.10 upgrade. The post's published body focuses on the tooling the team built to chase the leak — a lightweight, in-process memory profiler based on the CPython standard-library tracemalloc module, activated by sending SIGUSR2 to a running gunicorn worker in a Kubernetes pod. The first real attempt to capture a trace killed the worker; debugging why uncovered a subtle but general footgun at the intersection of gunicorn's preload=True option, Linux fork + copy-on-write semantics, and POSIX signal-handler inheritance. The article is an operational primer on signal-driven heap profiling in a pre-fork Python HTTP server, not a postmortem of the eventual leak itself (the body cuts off mid-investigation).

Key takeaways

  1. Pre-fork WSGI servers share memory via copy-on-write. Gunicorn with preload=True imports the application once in the leader process; forked workers inherit the pages read-only and only copy on write. Lyft observed worker PSS (proportional set size) drop from ~203 MB (no preload) to ~41 MB (preload) — nearly 5× — because the imports-and-application-object mass is now shared. (Source: body + two smem screenshots.)

  2. A signal handler registered at import time is owned by whoever ran the import. With preload=True, the signal.signal(SIGUSR2, handler) call executes in the leader process before fork. The forked workers inherit the handler table as-is from the leader — but Lyft's worker experienced kill -USR2 <worker-pid> terminating the worker process anyway. The post frames this as "the worker process did not register due to copy-on-write" and "that causes any kill -USR2 to actually kill the process", attributing the kill to the worker not having re-registered its own signal handler after fork. The lesson is operational even if the underlying mechanism is more subtle than COW strictly implies (see caveats): register signal handlers post-fork, not at import time, if you need per-worker in-process tooling under preload=True.

  3. tracemalloc is production-viable as an on-demand profiler. Lyft wraps it in a generator-based state machine driven by signals: first SIGUSR2tracemalloc.start() + capture snapshot1; second SIGUSR2 → capture snapshot2, diff against snapshot1, dump the ranked top-allocation-growth lines to a file, then tracemalloc.stop(). The cost of tracing is paid only during the capture window, between the two signals. (Source: MemoryProfiler pseudocode + _profiling_state_machine generator.)

  4. The USR2 / SIGUSR2 convention is intentional. SIGUSR1 and SIGUSR2 are POSIX "user-defined" signals that the kernel does not use; they exist precisely so application code can own them. Gunicorn itself already defines USR2 semantics on the leader process (in-place upgrade), which is part of why Lyft targeted the worker PID directly via kill -USR2 <worker-pid> and why, when the worker didn't in fact own a handler, the default disposition (terminate) applied.

  5. Debugging the profiler took longer than expected. The author mentions "several hours of debugging" before the connection to preload was made — emphasising that the failure mode (tracing the process kills the process) is non-obvious and that the link to gunicorn's pre-fork model is easy to miss if you think of preload purely as a memory optimisation.

  6. Python 3.8 → 3.10 was the triggering upgrade. The title frames the journey explicitly as "From Python 3.8 to Python 3.10" — the leak appeared after the version bump. The published body does not identify the root cause of the leak itself (the post is truncated before that section), so the post is cited here for the profiler design and the preload footgun, not for a resolved Python 3.10 regression.

Architectural primitives extracted

  • Gunicorn — Python WSGI pre-fork HTTP server. Leader process forks N workers; two fork modes: no-preload (workers each import the app) vs. preload=True (leader imports once, workers inherit via COW). Runs under Kubernetes at Lyft.
  • tracemalloc — CPython stdlib module that tracks memory allocations by traceback. Supports snapshotting and snapshot-diff; per-frame granularity; intended for debugging, not always-on observability. Lyft's profiler is a thin wrapper.
  • Pre-fork copy-on-write — the kernel-level mechanism preload=True exploits. Forked child process shares the parent's pages read-only; the kernel copies a page only when the child writes to it. Saved Lyft ~162 MB PSS per worker. Central to the memory win and the signal footgun.
  • Signal-handler fork inheritance — the concept. A signal handler installed before fork(2) is inherited by the child process according to POSIX rules, but application frameworks (notably gunicorn) may reset handlers in the worker post-fork. Registering inside an import that runs pre-fork with preload=True is therefore ambiguous: the handler table is inherited, but whoever comes along after fork (the gunicorn worker-startup code) can replace it, leaving the intended application handler absent in the worker.
  • Signal-triggered heap snapshot-diff — the pattern. Register a custom signal handler in a long-running server; on first signal, start tracing
  • take baseline; on second signal, take second snapshot, diff against baseline, dump ranked allocation deltas; stop tracing. Pays profiling cost only for the capture window. Generator / state-machine implementation keeps the handler side-effect-free.

Operational numbers from the post

  • Worker PSS, no preload: ~203 MB.
  • Worker PSS, with preload: ~41 MB (−~80 %, ~5× reduction).
  • "Several hours" to localise the problem to preload.
  • No disclosed leak size, leak source, or fixed p99 numbers — the published body stops at the preload discovery and does not reach resolution.

Caveats

  • Body is truncated. The published post ends at "the worker process did not register due to copy-on-write and that causes any kill -USR2 to actually kill the process". Whatever the team did next (patch: register post-fork hook; patch: use gunicorn's post_fork worker hook; patch: stop using preload for this service) is not in the ingested markdown. Mitigation patterns below are inferred from the problem statement, not quoted from Lyft.
  • COW is not strictly the reason the handler "didn't register". POSIX says a forked child inherits the parent's signal dispositions (man 2 fork: "the child inherits … the set of signal handlers … from its parent"). Gunicorn's worker initialisation explicitly resets signals to defaults for a worker before invoking user hooks — so in practice what Lyft is describing is gunicorn's post-fork reset overwriting the pre-fork handler, not the COW mechanism itself failing to propagate it. The operational takeaway (register post-fork) is the same; the mechanistic framing in the post is slightly loose. Flagged ⚠️ mechanism-framing rather than contradiction — the fix is identical either way.
  • No postmortem on the actual 3.8→3.10 leak. Do not cite this source for a Python 3.10 memory-leak regression; cite it for the profiler design and the preload footgun only.
  • Tier-2 Lyft source; in scope for the wiki because it documents a production memory-debugging tool + a general pre-fork server footgun, both of which generalise well beyond Python.

Source

Last updated · 319 distilled / 1,201 read