Skip to content

PATTERN Cited by 1 source

State machine as query lifecycle manager

Summary

Build a distributed-query engine's query lifecycle manager — the component that schedules, tracks, cancels, and tears down in- flight queries — as a deterministic state machine per query, with every transition logged and explicit teardown at every terminal state. The alternative — ad-hoc shared-state coordination, implicit cleanup, async-spawn- to-escape-deadlock cancellation — produces stuck queries, state disagreement across components, silent resource leaks, and unreasonable-about cancellation paths.

Canonical wiki instance: Oxla's 2026-01-27 query- manager rewrite (Source: sources/2026-01-27-redpanda-engineering-den-query-manager-implementation-demo).

Three composed properties

The pattern requires three properties holding together:

  1. Deterministic state machine — explicit states, enumerated events, total transition function. See concepts/deterministic-state-machine-for-lifecycle.
  2. Every transition logged — state-trajectory reconstructable from logs alone, enabling post-hoc debugging without reproduction. See concepts/state-transition-logging.
  3. Explicit teardown at terminal states — reaching Finished / Cancelled / Failed atomically releases resources. See concepts/explicit-teardown-on-completion.

Any one property alone doesn't deliver the pattern. All three together produce Oxla's reported outcome: no stuck queries, no state disagreement, no resource leaks, debug-in-days-not-weeks.

Contrast: what it replaces

Oxla's pre-rewrite substrate exhibited four failure modes verbatim:

  1. "Queries getting stuck without a clean cancellation path"
  2. "Inconsistent state reporting (scheduled vs. finished)"
  3. "Resources leaking indefinitely"
  4. "Cancellation logic spawning new threads to avoid deadlocks"
  5. "No visibility into what's happening"

Each failure maps to a missing property:

  • Stuck queries → no enumerated states; "stuck" is an implicit state not in the state machine.
  • Inconsistent state reporting → state is maintained in multiple places without an authoritative source.
  • Resource leaks → no explicit teardown on terminal-state transition.
  • Thread-spawn cancellation → see concepts/async-cancellation-thread-spawn-antipattern — cancellation isn't a state-machine event, so it has to reconstruct state at cancellation time.
  • No visibility → no transition logging — the state machine's trajectory isn't recorded.

The pattern's three properties structurally address all five.

Outcomes claimed

Oxla's 2026-01-27 post names six outcomes of the rewrite verbatim:

  • "Zero stuck queries"
  • "Complete logging of all state transitions"
  • "Clear visibility into current state and events"
  • "Fast debugging (issues fixed in days instead of weeks)"
  • "25,000 queries tested successfully across 1-3 node cluster"
  • "Ready for rollout!"

The debuggability claim verbatim: "Bugs still happened, as they always do with new code, but they were much easier to track down. Being able to trace state transitions made fixes straightforward instead of exploratory."

Applicability

This pattern applies when you are managing:

  • Long-lived concurrent workloads where each workload has a non-trivial lifecycle (not just startend).
  • External-to-the-workload cancellation: the lifecycle must respond to cancellation from outside the workload itself (timeouts, admin kill, parent cancellation).
  • Multiple terminal states with different cleanup semantics (normal completion, user cancellation, error failure).
  • Debuggability requirements — you need to reconstruct what happened after the fact.

Query managers are the canonical instance. Other applicable substrates:

Cost / trade-offs

  • Upfront design cost: enumerating all states and transitions is harder than letting state emerge. This is the same argument for / against strong typing, stateful workflow DSLs, explicit protocol design. The discipline has a design cost.
  • Transition overhead: every transition = one log record + one teardown routine invocation + event-dispatching coordination. For high-QPS workloads (many short queries), this is non- trivial. Oxla's rewrite didn't disclose throughput numbers.
  • State-explosion risk: naive state machine design can produce exponentially-many states as concurrent factors multiply. Discipline is to keep the state space minimal and orthogonal — hierarchical state machines or composition of smaller machines.
  • Logging storage cost: transition logging generates one record per transition. At scale, this can be significant.

Composition with other patterns

  • Composes with durable event log as audit envelope: if the state machine's transition log is written to a durable streaming log, the audit trail + replay substrate emerges naturally. This is the shape ADP converges to for agent interactions.
  • Composes with two- phase tentative-then-complete: query manager states can be modelled as tentative (in flight) and durable (committed), with cancellation as a state-machine event.

Seen in

Last updated · 470 distilled / 1,213 read