PATTERN Cited by 1 source

Hot reload over restart replicas¶

Intent¶

When a cluster's shared state changes (schema update, new index, live-settings tweak), propagate the change to all replicas by having them hot-reload the new state from the state backend rather than by rolling restart of the replica fleet. Restarts are minutes-long per replica; hot reload is milliseconds. For frequently-changing cluster state, the restart path dominates change-propagation time and indirectly constrains how often anyone is willing to make changes.

Shape¶

  change flow:
    1. client issues state-change request to primary
    2. primary constructs new immutable state = old + change
    3. primary commits new state to state backend (S3 / EBS)
    4. primary atomically swaps in-memory state reference
    5. primary notifies replicas of new state version
    6. each replica fetches the new state object
    7. each replica atomically swaps its own state reference
       (no restart; in-flight requests continue on old state,
        subsequent requests see new)

Preconditions:

State is immutable + atomically-swapped, so in-flight requests aren't affected by the change.
The state backend is durable and has version/etag semantics (for the notify + fetch step to be idempotent + race-free).
The running process can rebuild internal structures that depend on state (analyzer chains, field mappings, codec instances, etc.) without restart — i.e. code is written to reload its own state rather than caching at init time.

Canonical instance: Yelp Nrtsearch 1.0.0¶

Pre-1.0 Nrtsearch state propagation was:

"1. Issue requests to the primary to change index state 2. Issue an index commit request on the primary, which makes the data and state durable on local storage (EBS) 3. Issue a backup request on the primary, which makes the latest committed state and index data durable in remote storage (S3) 4. Restart all cluster replicas to load the latest remote state and data" (sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10)

Four steps, the last of which is a rolling-restart of the whole replica fleet.

The 1.0 redesign collapses this to:

"The state update process is now simplified to: Issue requests to the primary to change state / Hot reload state on all replicas. This greatly reduces the time needed to apply a change to the whole cluster."

Combined with:

patterns/decoupled-state-commit-from-data-commit — state commit happens per-request (not batched with a data commit);
concepts/immutable-index-state — readers always see a stable, committed snapshot of state for their request's lifetime.

Why this matters¶

Rolling-restart state propagation creates a silent operational pressure: operators batch schema changes so they only pay the restart cost once a day/week, which slows down schema evolution across the product. Hot reload makes schema changes cheap enough that they can happen at application-request rate — which in turn unlocks things like per-tenant schema customisation, auto-generated field mappings for new data types, and runtime-tuning of live-settings without a change-window.

Tradeoffs¶

Complexity of reload paths. Every piece of code that caches state-derived structures must be refactored to observe the reload signal. Hot reload is a whole-codebase discipline, not a single-module change.
In-flight consistency must be handled carefully — this is where the immutable-state discipline pays for itself. Without it, hot reload is race-prone.
Reload failures. If a replica fails to load new state, it must either (a) keep serving on the old state and alert, or (b) self-remove from the cluster. Silently serving stale state is the worst outcome.

Adjacent patterns¶

patterns/hot-reloadable-configuration — the broader shape, applied to config reload rather than cluster state.
patterns/settings-aware-connection-pool — hot-reload discipline applied to database connection pools as settings change.

Seen in¶

sources/2025-05-08-yelp-nrtsearch-100-incremental-backups-lucene-10 — canonical wiki instance. Nrtsearch 1.0.0's state-management overhaul replaces replica-restart-on-state-change with hot reload, collapsing a four-step propagation process into two steps.