SYSTEM Cited by 1 source

Gunicorn¶

Summary¶

Gunicorn ("Green Unicorn") is a Python WSGI HTTP server with a pre-fork worker model: a single leader (master) process accepts configuration, binds the listening socket, then forks N worker processes that actually serve HTTP requests. It's the default production serving tier for Flask / Django / generic WSGI apps in a large fraction of Python shops, including Lyft (Lyft).

Pre-fork model¶

One leader process: reads config, manages worker lifecycle (spawn, replace on crash, rolling restart on USR2).
N worker processes: each inherits the listening socket from the leader; the kernel load-balances accept(2) across them ("thundering herd" mitigated by modern kernels).
Worker count tuned to cores (--workers 2*NCPU+1 is the usual starting heuristic); each worker runs its own CPython interpreter, sidestepping the GIL at the process level.

`preload=True` — the copy-on-write memory optimisation¶

By default each forked worker re-imports the application, so the Python objects (code, modules, large constants, warm caches) are duplicated N times in resident memory.

With preload_app = True, the leader imports the application once, and the OS exploits pre-fork copy-on-write: forked workers share the leader's physical pages until they write to them. The result is a large drop in per-worker PSS (proportional set size).

Lyft, 2025-12-15 (sources/2025-12-15-lyft-from-python38-to-python310-memory-leak):

Mode	Worker PSS
No preload	~203 MB
`preload=True`	~41 MB

That's a ~5× reduction per worker, free, at the cost of sharp edges around things imported before fork — particularly signal handlers (see below).

Signal-handler caveats under `preload=True`¶

If a module imported before fork does signal.signal(SIGUSR2, handler), the handler is installed in the leader process. The POSIX spec says the child inherits signal dispositions from the parent across fork(2) — but gunicorn's worker-initialisation code explicitly resets signals to defaults in the worker before invoking user hooks. The net effect at Lyft was that kill -USR2 <worker-pid> hit a worker with the default disposition (terminate), killing the worker instead of triggering the expected handler.

The fix (operational, not quoted from Lyft): register signal handlers from a post_fork hook in the gunicorn config, not at module-import time. Gunicorn exposes post_fork(server, worker) for exactly this case; it runs after gunicorn's default-signal reset. See concepts/signal-handler-fork-inheritance for the general concept and patterns/signal-triggered-heap-snapshot-diff for the specific pattern this enables.

Leader-process signal semantics (selected)¶

TERM → graceful shutdown.
HUP → reload config + workers.
USR1 → reopen log files (rotation).
USR2 → in-place upgrade of the gunicorn binary (fork a new leader that forks new workers; old leader waits).
USR2 on a worker → the default POSIX disposition unless the application has installed a handler post-fork.

The overload of USR2 between leader (in-place upgrade) and worker (application-owned) is part of what makes the Lyft footgun easy to trip: engineers assume USR2 is "the Python signal" when it's really just a user-defined signal with multiple consumers.

Observability / memory debugging¶

Signal-driven in-process profilers are a common gunicorn idiom:

SIGUSR1 or SIGUSR2 → dump a heap snapshot (e.g., tracemalloc) or a thread stack trace (faulthandler.dump_traceback) to disk.
Run the signal from a sidecar / debug shell (kill -USR2 <worker-pid>) without restarting the worker or pausing traffic.
Diff two snapshots to surface allocation-growth hotspots.

Lyft's MemoryProfiler is an instance of this pattern, wired as a generator state machine over two signals (start + capture+diff). See patterns/signal-triggered-heap-snapshot-diff.

Deployment shape at Lyft¶

Each gunicorn leader + N workers runs in a Kubernetes pod; per-pod resource limits gate worker count.
preload=True is standard to keep per-pod memory in budget.
On-demand profiling via kubectl exec into the pod and kill -USR2 <pid> against the target worker PID from ps aux.

concepts/pre-fork-copy-on-write — the kernel mechanism that makes preload=True a memory win.
concepts/signal-handler-fork-inheritance — the concept behind the preload signal-handler footgun.
patterns/signal-triggered-heap-snapshot-diff — the profiling pattern gunicorn services commonly implement.
systems/tracemalloc — the stdlib allocation tracer typically wrapped.
concepts/gil-contention — why gunicorn uses processes instead of threads for CPU-bound Python workloads.

Seen in¶

sources/2025-12-15-lyft-from-python38-to-python310-memory-leak — Lyft's debug journey through preload=True + SIGUSR2 + tracemalloc after a Python 3.8 → 3.10 upgrade. Quantifies the preload memory win (~203 MB → ~41 MB PSS / worker) and documents the signal-handler footgun.