SYSTEM Cited by 1 source
Gunicorn¶
Summary¶
Gunicorn ("Green Unicorn") is a Python WSGI HTTP server with a pre-fork worker model: a single leader (master) process accepts configuration, binds the listening socket, then forks N worker processes that actually serve HTTP requests. It's the default production serving tier for Flask / Django / generic WSGI apps in a large fraction of Python shops, including Lyft (Lyft).
Pre-fork model¶
- One leader process: reads config, manages worker lifecycle
(spawn, replace on crash, rolling restart on
USR2). - N worker processes: each inherits the listening socket from
the leader; the kernel load-balances
accept(2)across them ("thundering herd" mitigated by modern kernels). - Worker count tuned to cores (
--workers 2*NCPU+1is the usual starting heuristic); each worker runs its own CPython interpreter, sidestepping the GIL at the process level.
preload=True — the copy-on-write memory optimisation¶
By default each forked worker re-imports the application, so the Python objects (code, modules, large constants, warm caches) are duplicated N times in resident memory.
With preload_app = True, the leader imports the application
once, and the OS exploits
pre-fork copy-on-write:
forked workers share the leader's physical pages until they write
to them. The result is a large drop in per-worker PSS
(proportional set size).
Lyft, 2025-12-15 (sources/2025-12-15-lyft-from-python38-to-python310-memory-leak):
| Mode | Worker PSS |
|---|---|
| No preload | ~203 MB |
preload=True |
~41 MB |
That's a ~5× reduction per worker, free, at the cost of sharp edges around things imported before fork — particularly signal handlers (see below).
Signal-handler caveats under preload=True¶
If a module imported before fork does
signal.signal(SIGUSR2, handler), the handler is installed in the
leader process. The POSIX spec says the child inherits signal
dispositions from the parent across fork(2) — but gunicorn's
worker-initialisation code explicitly resets signals to defaults in
the worker before invoking user hooks. The net effect at Lyft was
that kill -USR2 <worker-pid> hit a worker with the default
disposition (terminate), killing the worker instead of triggering
the expected handler.
The fix (operational, not quoted from Lyft): register signal
handlers from a post_fork hook in the gunicorn config, not at
module-import time. Gunicorn exposes post_fork(server, worker) for
exactly this case; it runs after gunicorn's default-signal
reset. See concepts/signal-handler-fork-inheritance for the
general concept and patterns/signal-triggered-heap-snapshot-diff
for the specific pattern this enables.
Leader-process signal semantics (selected)¶
TERM→ graceful shutdown.HUP→ reload config + workers.USR1→ reopen log files (rotation).USR2→ in-place upgrade of the gunicorn binary (fork a new leader that forks new workers; old leader waits).USR2on a worker → the default POSIX disposition unless the application has installed a handler post-fork.
The overload of USR2 between leader (in-place upgrade) and worker
(application-owned) is part of what makes the Lyft footgun easy to
trip: engineers assume USR2 is "the Python signal" when it's
really just a user-defined signal with multiple consumers.
Observability / memory debugging¶
Signal-driven in-process profilers are a common gunicorn idiom:
SIGUSR1orSIGUSR2→ dump a heap snapshot (e.g., tracemalloc) or a thread stack trace (faulthandler.dump_traceback) to disk.- Run the signal from a sidecar / debug shell
(
kill -USR2 <worker-pid>) without restarting the worker or pausing traffic. - Diff two snapshots to surface allocation-growth hotspots.
Lyft's MemoryProfiler is an instance of this pattern, wired as a
generator state machine over two signals (start + capture+diff). See
patterns/signal-triggered-heap-snapshot-diff.
Deployment shape at Lyft¶
- Each gunicorn leader + N workers runs in a Kubernetes pod; per-pod resource limits gate worker count.
preload=Trueis standard to keep per-pod memory in budget.- On-demand profiling via
kubectl execinto the pod andkill -USR2 <pid>against the target worker PID fromps aux.
Related¶
- concepts/pre-fork-copy-on-write — the kernel mechanism that
makes
preload=Truea memory win. - concepts/signal-handler-fork-inheritance — the concept behind the preload signal-handler footgun.
- patterns/signal-triggered-heap-snapshot-diff — the profiling pattern gunicorn services commonly implement.
- systems/tracemalloc — the stdlib allocation tracer typically wrapped.
- concepts/gil-contention — why gunicorn uses processes instead of threads for CPU-bound Python workloads.
Seen in¶
- sources/2025-12-15-lyft-from-python38-to-python310-memory-leak
— Lyft's debug journey through
preload=True+SIGUSR2+ tracemalloc after a Python 3.8 → 3.10 upgrade. Quantifies the preload memory win (~203 MB → ~41 MB PSS / worker) and documents the signal-handler footgun.