Skip to content

CONCEPT Cited by 1 source

Lameduck mode

Definition

Lameduck mode is a node state in which in-flight requests are allowed to complete but no new requests are accepted. It is the drain primitive under any graceful leadership-transition or planned-shutdown path: it bounds the residual work on a node about to be demoted or decommissioned to what was already admitted, gives the server time to flush that work cleanly, and signals upstream callers that new traffic should go elsewhere.

Sougoumarane's canonical wiki framing: "If a PRS is issued, the low level vttablet component of vitess goes into a lameduck mode where it allows in-flight transactions to complete, but rejects any new ones." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-4-establishment-and-revocation).

Why it is load-bearing for graceful transitions

A graceful leader demotion has three concerns that must be handled simultaneously:

  1. Drain in-flight work safely — the old leader still has open transactions; killing them would surface errors to clients.
  2. Stop accepting new work — once demotion has been initiated, new work on the old leader would have to be undone or rolled to the new leader mid-flight.
  3. Hold new traffic somewhere until the new leader is ready — otherwise the application sees errors during the gap.

Lameduck mode handles (1) and (2). The paired primitive at the proxy tier — query buffering — handles (3). Together they make application-transparent reparenting possible.

Composed with query buffering at the proxy tier

Vitess's reparenting mechanism is a two-tier composition:

   vtgate (proxy)                  vttablet (storage)
   ┌──────────────┐                ┌──────────────┐
   │   buffer     │                │   lameduck   │
   │  new tx      │                │  drain in-   │
   │              │                │  flight tx   │
   └──────┬───────┘                └──────┬───────┘
          │                                │
          │   PRS completes                │
          │   ←─────────────────────────── │
          │                                │
          │   flush buffer to new primary  │

"At the same time, the front-end proxies (vtgate) begin to buffer such new transactions. Once PRS completes, all buffered transactions are sent to the new primary, and the system resumes without serving any errors to the application."

Neither layer is sufficient alone: vttablet lameduck without vtgate buffering would drop new traffic; vtgate buffering without vttablet lameduck could queue traffic for a sick primary.

Only meaningful for planned transitions

Lameduck is a graceful-path-only primitive: it requires the node being drained to be reachable, responsive, and cooperative. In crash or partition scenarios, the old primary cannot enter lameduck mode because it cannot be reached. That failure-mode class is handled by follower fencing + ERS-style emergency reparent, not by lameduck drain.

Broader applicability

Lameduck mode generalises beyond consensus and reparenting:

  • Kubernetes pod termination — a pod receives SIGTERM, stops readiness probes (load balancer stops routing new traffic), drains open requests, then exits.
  • Load balancer node removal — unhealthy or rotating-out hosts stop accepting new connections while finishing existing ones.
  • Connection pool reconfiguration — connections in the old config pool close gracefully after current transactions commit.

The common structure: bounded residual work + no new admission + external traffic shifted elsewhere.

Trade-offs

  • Maximum drain time bounds rollout speed. If in-flight transactions take minutes, the reparent stalls for minutes. Production deployments cap the drain window and force-terminate stragglers after a timeout.
  • Lameduck state must be communicated upstream, not just implemented locally. Proxies must know to route elsewhere, or clients must be told via explicit protocol signals — otherwise "lameduck" at the server is just slow-response to callers.
  • Extension of the no-errors guarantee depends on buffer capacity. vtgate can only buffer so many pending transactions; if PRS stalls, the buffer fills, and the guarantee breaks.

Seen in

Last updated · 347 distilled / 1,201 read