Skip to content

Redpanda — Engineering Den: Query manager implementation demo

Summary

First post in Redpanda's new Engineering Den series, a short (~600 words) post-acquisition disclosure from the Oxla team covering their rewrite of the query manager — the component "responsible for the lifecycle of currently-running queries" (scheduling, cancelling, restarting). The old manager suffered from ambiguous state (queries stuck in finished or executing while still holding resources; different parts of the system disagreed about what was happening) and a pathological cancellation path (spawned async work per thread, retried cancellation from a different thread to avoid deadlocks). Instead of patching, the team rebuilt the manager as a deterministic state machine with every transition logged, explicit teardown on completion, and no ambiguity about current state. Tested on ~25,000 queries across one- and three-node clusters without reproducing the prior pathologies. Expected in production "within days" of the post.

Key takeaways

  1. State management was the root cause, not cancellation. The cancellation bugs in the old manager were downstream of ambiguous state reporting"Queries could get stuck in 'finished' or 'executing' while still holding onto resources. Different parts of the system disagreed about what was actually happening. A query might show as scheduled in one place and finished in another." From outside, the cluster looked healthy; internally state was inconsistent across components. Canonicalises query lifecycle manager as the missing substrate.

  2. Cancellation-spawning-threads-to-avoid-deadlocks is an anti-pattern. Verbatim: "To avoid deadlocks, the old code gathered running queries, spawned async work per thread, and sometimes had to retry cancellation from a different thread entirely. That approach had already caused problems in the past, and it made the system hard to reason about and harder to debug." Canonicalises async-cancellation-thread-spawn as an anti-pattern — spawning more concurrency to avoid deadlock makes the system less reasonable-about, not more.

  3. Deterministic state machine replaces the ad-hoc manager. Verbatim: "The new scheduler is built as a deterministic state machine. At any point, it's in a known state, handling a specific event, and transitioning predictably. Every transition is logged." Canonicalises deterministic state machine for lifecycle and the composed pattern patterns/state-machine-as-query-lifecycle-manager. The substrate shift is from emergent behaviour to enumerated transitions — the set of states and events is fixed, transitions are total functions.

  4. Every transition logged = debuggability is a first-class property. Verbatim: "Every transition is logged. That means when something goes wrong, you can look at the logs and see exactly where the scheduler was and what it was doing. There's no ambiguity about whether a query is running, scheduled, canceled, or done. The system always knows, and you can see it." Canonicalises state-transition logging as the reliability property that converts lifecycle debugging from exploratory to deterministic. Claimed payoff: "Bugs still happened, as they always do with new code, but they were much easier to track down. Being able to trace state transitions made fixes straightforward instead of exploratory." — issues "fixed in days instead of weeks".

  5. Explicit teardown is the resource-leak fix. Verbatim: "Cleanup is also explicit now. When a query finishes, the scheduler and executors are torn down cleanly. Finished queries stay finished. Canceled queries are accounted for. Nothing hangs around quietly consuming resources anymore." Canonicalises explicit teardown on completion as the companion to deterministic state — the terminal state is not just a label; it's also a scheduled resource-release point. Resolves the "stuck in 'finished' while holding resources" pathology.

  6. ~25,000-query correctness test on 1- and 3-node clusters. Verbatim: "We've run around 25,000 queries on one- and three- node clusters with the new implementation and haven't seen the kinds of issues that were common before. No stuck queries, no confusing state, no guessing what the system thinks is happening." The validation frame is absence of pathologies, not throughput/latency benchmarks — reliability-first testing. Production rollout "within days" of post.

Systems / concepts / patterns extracted

Operational numbers

  • ~25,000 queries run on the new implementation without reproducing prior pathologies.
  • 1- and 3-node clusters — validation cluster shapes at post time.
  • Issues fixed "in days instead of weeks" vs the prior implementation — self-reported debugging-time delta (not measured rigorously).
  • "Within days" — production rollout target from post-publication.
  • No throughput, latency, or concurrency numbers disclosed.
  • No failure-injection / chaos test shape disclosed.

Caveats

  • Short engineering-diary voice, not technical deep-dive. ~600 words total; no code snippets, no state diagram, no enumeration of states or events, no discussion of the exact shape of the state machine (Mealy vs Moore; who drives transitions; how events are serialised). Canonicalises the reliability doctrine but not its mechanism.
  • "Deterministic" is claimed, not shown. The post asserts "deterministic state machine" — no formalisation (TLA+, Alloy, property-based test, model-check). Determinism at the transition-function level is consistent with non-determinism at the event-ordering level (concurrent event arrivals, thread scheduling). The post doesn't address this.
  • Cancellation path not fully detailed. The old problem — "spawned async work per thread, retried cancellation from a different thread" — is named. The new solution substrate (state machine + logging) is named. But the actual new cancellation protocol (does the manager thread handle all state transitions? how does a worker thread signal cancellation to the manager? what's the concurrency model?) is elided.
  • Failure-modes not enumerated. No discussion of what happens if the manager thread itself crashes, or if the state transition log outpaces disk / observability pipeline. All the existing-pathology framings are clean; the new-failure- modes framing is absent.
  • 25K-query sample is modest. One- and three-node clusters are developer-scale, not production-scale. Larger-cluster testing is named as "better prepared for scale by building confidence with large node number clusters" — aspirational.
  • Compared against its own prior implementation only. No comparison to other distributed SQL engines' query managers (Presto/Trino, Spark SQL driver, Snowflake warehouse manager, Dremel). The architectural claim (deterministic state machine for query lifecycle) is a classic shape; the novelty is internal to Oxla's rewrite, not the industry.
  • Engineering Den is a new series. First post announced "a new series where our engineers give you a quick peek under the hood at how they're upgrading the Redpanda Streaming engine and Agentic Data Plane" — future posts expected; the series promises lightweight engineering-diary altitude, not retrospectives or deep-dives.
  • Post-acquisition context. Oxla was acquired in October 2025 as part of the Agentic Data Plane launch; the product was already shipping pre-acquisition. Unclear whether the query-manager rewrite predates or postdates the acquisition. Post reads as continuation of pre-existing Oxla roadmap.
  • Unsigned. Default Redpanda (or Oxla-team) attribution; no named author.
  • No mention of agentic workload coupling. Oxla is framed elsewhere as the agentic-SQL surface for ADP; this post is about the query manager substrate, agent-workload-neutral.

Source

Last updated · 470 distilled / 1,213 read