CONCEPT Cited by 1 source
Query lifecycle manager¶
Definition¶
A query lifecycle manager is the component of a distributed SQL engine "responsible for the lifecycle of currently-running queries" — scheduling them onto executors, tracking their progress, cancelling them on request, handling their completion, releasing their resources (Source: sources/2026-01-27-redpanda-engineering-den-query-manager-implementation-demo).
Distinct from:
- Query planner / optimiser — which produces the plan from SQL text; the manager executes the plan. Planner work completes before manager work starts.
- Query executor — the per-node / per-thread worker that runs a piece of the plan. The manager orchestrates executors; it doesn't do the data processing itself.
The manager sits between the planner (upstream) and the executors (downstream), owning the query's identity and lifecycle:
Canonical wiki instance¶
Oxla's 2026-01-27 query-manager rewrite. The Oxla team's own post is the first canonical wiki disclosure of a query lifecycle manager as a named component; prior wiki coverage of distributed query engines (Presto, Trino, Spark SQL, Snowflake warehouses, Vitess evalengine) describes planners and executors but names their lifecycle-management components only implicitly.
The Oxla rewrite framed the manager as the component where the previous implementation's reliability bugs concentrated:
- Scheduling bugs: queries stuck in the scheduling phase without progress.
- Cancellation bugs: "spawned async work per thread, retried cancellation from a different thread entirely" — the cancellation code path was the worst-affected substrate. See concepts/async-cancellation-thread-spawn-antipattern.
- State-reporting bugs: "a query might show as scheduled in one place and finished in another" — two different queries to the manager returning different answers.
- Resource-leak bugs: "queries could get stuck in 'finished' or 'executing' while still holding onto resources" — the manager's teardown path was not reliably triggered.
Primary responsibilities¶
A production-grade query lifecycle manager must handle:
- Schedule: place a new query into the lifecycle (assign executors, allocate resources).
- Execute: track the query's progress through its plan (stages completing, intermediate results flowing).
- Cancel: respond to an external cancellation request (timeout, user abort, admin kill) by terminating execution and reclaiming resources — a non-trivial concurrency problem (see concepts/request-cancellation).
- Complete: handle normal termination — collect final result, mark query done, tear down executors (see concepts/explicit-teardown-on-completion).
- Report: expose query state to external observers (admin tools, health checks, observability pipelines).
- Restart (optional): handle partial-failure recovery — some engines retry failed stages, some retry the whole query, some fail fast.
The Oxla architectural claim¶
The 2026-01-27 post's load-bearing claim: the manager should be built as a deterministic state machine, with every transition logged (see concepts/state-transition-logging) and explicit teardown at terminal states (see concepts/explicit-teardown-on-completion). This composition forms the patterns/state-machine-as-query-lifecycle-manager pattern.
The reliability argument: ad-hoc shared-state management produces the pathologies Oxla named (stuck queries, state disagreement, resource leak, cancellation pathology). Explicit state-machine design eliminates them by construction.
The debuggability argument: when bugs do happen, trajectory reconstruction from the transition log makes them "fixed in days instead of weeks".
Composability¶
The query lifecycle manager is one instance of a broader pattern: a stateful resource managed through its full lifetime by a per-resource state machine. Related lifecycle-management substrates:
- Consensus request lifecycle: see concepts/two-phase-completion-protocol + the tentative/durable/cancelled states.
- Leader lifecycle: see concepts/leader-establishment + concepts/leader-revocation.
- Workflow lifecycle: see concepts/fault-tolerant-long-running-workflow.
- Connection lifecycle: TCP state machine (classic formal example).
All share the deterministic-state-machine + explicit-transitions + terminal-teardown structure.
Seen in¶
- sources/2026-01-27-redpanda-engineering-den-query-manager-implementation-demo — canonical wiki instance. Oxla 2026-01-27 query-manager rewrite disclosure. First wiki canonicalisation of query-lifecycle-manager as a named substrate.
Related¶
- concepts/deterministic-state-machine-for-lifecycle
- concepts/state-transition-logging
- concepts/explicit-teardown-on-completion
- concepts/async-cancellation-thread-spawn-antipattern
- concepts/query-planner
- concepts/request-cancellation
- concepts/two-phase-completion-protocol
- concepts/fault-tolerant-long-running-workflow
- patterns/state-machine-as-query-lifecycle-manager
- systems/oxla