Skip to content

REDPANDA 2026-06-02

Read original ↗

Redpanda — How OmniNode uses Redpanda to scale AI agent workflows

A Redpanda Blog guest post (2026-06-02) by Jonah Gray, founder and CEO of OmniNode, on the migration of OmniNode's multi-agent coordination bus from Redis Streams to Redpanda at the ~100 event types / 12 repositories scale point, and on the contract-driven topic-naming discipline the team adopted when the topic name became the load-bearing coordination surface across independently-developed agents. Tier-3 source (Redpanda blog, guest-post genre) included on architecture-content grounds: the post is structurally an engineering narrative — "we used Redis Streams, then scale changed the problem, then we moved to Redpanda, then naming became the bug surface, then we made the contract own the names" — with concrete code, a named bug class (silent wiring failure from hyphen vs underscore), a disclosed extractor-and-three-call-sites topology, and a candid scope boundary ("creation is contract-owned; reconciliation is a different problem" — drift is acknowledged but unsolved).

A second strand of the post discloses OmniNode's routing layer: every AI-agent task is classified and routed to "the cheapest model that can actually do it" with auto-escalation on quality failures; every routing decision produces a receipt (model chosen, tokens, cost, compliance check). Disclosed week-of-numbers: "75% of tokens never left the building" (routed to four on-prem hosts at zero marginal cost), "$3.37 in cloud spend was avoided, compared to $2.43 actually spent", "1.3% of delegations" escalated to a stronger model.

Summary

OmniNode is "a contract-driven runtime for AI agent orchestration" that coordinates fleets of agents that build, test, review, and deploy software across multiple repositories simultaneously. Every unit of work is a node; every node ships a contract.yaml that declares "exactly what the node consumes and produces, what events it listens to, and what topics it publishes on the bus." The runtime uses these contracts to provision infrastructure, route tasks, validate completions, and enforce ordering across agents "that would otherwise have no idea what each other is doing." The event bus is "the spine of the whole system" — agents publish events when work is done and subscribe to events that trigger their next step; "each agent knows only its own inputs and outputs, while the bus handles the rest."

OmniNode originally used Redis Streams behind a transport-layer abstraction (publish via XADD, consume via XREADGROUP, topic names kept "in Python constants near the code that used them"). Apache Kafka was "explicitly deferred in the roadmap because the system was still small." Then "the system grew from 5 repositories to 12, and the event catalog surpassed 100 event types." The migration trigger was not throughput — it was coordination: consumer groups, partition-level parallelism, durable replay semantics, topic introspection, programmatic provisioning. Redpanda gave them "Kafka-API compatibility in a single binary"; the transport abstraction made the broker swap "straightforward."

The unexpected challenge was topic identity. With Redis Streams, "topics were usually owned end-to-end by a single developer" — producer and consumer lived close together. With 100 Kafka-shaped topics spread across independent repositories, "the topic name became the critical coordination surface" — the only thing connecting one agent's output to another agent's input. The disclosed bug: a producer published to onex.evt.router.routing-complete.v1; the consumer subscribed to onex.evt.router.routing_complete.v1. "Both services started cleanly, both topic names were accepted, and nothing failed — yet the routing pipeline silently stopped working." Same shape kept reappearing: pluralization differences, underscores versus hyphens, version suffix mismatches, renamed event segments, old topics left behind after refactors"the silence was the failure mode." The team realised this was "a naming problem before it was a schema problem."

The fix: the contract owns the topic names. Every node ships a contract.yaml with a subscribe_topics list and a publish_topics list — "the contract owns the node's bus surface." Topic names follow a regex-validated shape onex.{kind}.{producer}.{event}.v{N}; "a regex validates the structure, and a StrEnum backs the canonical registry. The regex catches malformed names, while the enum catches names that are syntactically valid but not canonical." Crucially, "the contract becomes the only reviewed location where wire-format topic names live. There is no second operator-maintained registry, separate constant list hidden inside the runtime, or manually synchronized provisioning config."

The implementation enforces this discipline via a single ContractTopicExtractor "that discovers approved packages, loads each contract.yaml, and returns the union of declared topics"run in three independent places: (1) Pre-deploy CI ("creates the declared topics against the broker"; smoke tests fail before the change merges), (2) Runtime boot ("there is no fallback constant list… If the contracts are silent, the provisioner is silent"), (3) Post-boot validation ("if a declared topic is missing, say because topic creation partially failed during startup, the validator re-invokes provisioning"). The slogan: "a topic name can only be wrong in the contract. While the processes and lifecycle stages may differ, there is only one parser. Ultimately, being disciplined matters more than the infrastructure."

The provisioner has "a deliberately narrow scope: it creates missing topics." Provisioning is "async, best-effort, and non-blocking." If a topic exists, "the provisioner leaves it alone. It does not: reconcile partition counts, reconcile replication factors, reconcile retention policies, mutate existing topics." If a topic was originally created with 6 partitions and the contract later requests 12, "the provisioner will not notice." This boundary is intentional"creation is contract-owned; reconciliation is a different problem." The post explicitly names configuration drift (partition count, replication factor, retention policy) as the next gap and sketches a future drift-detection-as-query layer ("the contract describes what should exist, while the broker reports what does exist. A materialized projection over both turns drift detection into a query instead of a script. That reconciliation layer is not built yet.").

A second strand of the post discloses OmniNode's routing discipline: every AI-agent task gets classified and routed to "the cheapest model that can actually do it, and the result gets checked before it counts as done." Most agent work — classification, code generation, refactoring, summarization — "a local model running on hardware I already own handles them fine. The expensive cloud models are a fallback for the hard cases, not the default." Disclosed week-of metrics from the OmniNode dashboard: - 75% of tokens never left the building (routed to four on-prem hosts at zero marginal cost) - $3.37 in cloud spend was avoided, compared to $2.43 actually spent - "At a larger scale, that ratio is the whole business case." - 1.3% of delegations escalated automatically to a stronger model when the local model couldn't meet the bar (output too short, missing citations, or hallucinated identifiers) - Every routing decision "produces a receipt: which model was chosen, how many tokens it took, what it cost, and whether the output passed its compliance checks""the receipt is the evidence. Neither lives in someone's head."

The architectural through-line connecting both strands is single-source-of-truth discipline: "the same discipline that keeps topic names from drifting — one canonical source, validated, with no second hidden copy — is what lets me hand work to the cheapest model without hoping it went well. The decision is a contract. The receipt is the evidence."

Key takeaways

  1. Redis Streams → Kafka migration trigger was coordination, not throughput. "We outgrew Redis Streams not because of throughput, but because coordination itself became difficult." At 5 repos / single-developer-per-topic ownership, Redis Streams handled the message patterns fine. At 12 repos / 100+ event types with cross-repo subscribe/publish patterns, the team needed consumer groups, partition-level parallelism, durable replay semantics, topic introspection, and programmatic provisioning — the load-bearing capabilities the Kafka model gives you that Redis Streams does not. Redpanda was chosen specifically because "Kafka-API compatibility in a single binary" lets the broker "boot in a single container, fits comfortably into an 8 GB development profile, and uses the same compose file everywhere" so "the broker exists everywhere that code executes: local development, CI, dev containers, the homelab runtime". The architectural argument: "if the broker is operationally heavy, teams eventually stop running it locally. They fake the bus, mock topic creation, or maintain a second development path that doesn't actually validate topic identity."

  2. At Kafka-shaped scale across independent repos, the topic name IS the coordination surface. "In our architecture, the topic name was the only thing connecting one agent's output to another agent's input." This is the canonical topic name as coordination surface disclosure on the wiki: when a system scales from co-located producer/consumer to distributed producer/consumer pairs, the topic name takes on a load-bearing role it didn't have before, and naming-drift becomes a silent failure mode.

  3. Silent wiring failure is the canonical bug class. onex.evt.router.routing-complete.v1 (producer) vs onex.evt.router.routing_complete.v1 (consumer): "both services started cleanly, both topic names were accepted, and nothing failed — yet the routing pipeline silently stopped working. The silence was the failure mode." Five disclosed sub-shapes: pluralization differences, underscores vs hyphens, version-suffix mismatches, renamed event segments, and "old topics left behind after refactors". "Every instance had the same shape, where both names were well-formed and both operations succeeded." See concepts/silent-wiring-failure for the named bug class.

  4. Contract.yaml owns the bus surface — the contract is the only reviewed location where wire-format topic names live. Per-node contract.yaml declares subscribe_topics: and publish_topics: lists; topic names follow a regex-validated shape onex.{kind}.{producer}.{event}.v{N}; a StrEnum backs the canonical registry. "The regex catches malformed names, while the enum catches names that are syntactically valid but not canonical." Critically: "there is no second operator-maintained registry, separate constant list hidden inside the runtime, or manually synchronized provisioning config. If a node wants the system to provision and validate a topic, it must put the topic name in its contract." See patterns/contract-driven-topic-provisioning for the pattern.

  5. One extractor, three independent call sites — the topology is the point. A single ContractTopicExtractor discovers approved packages, loads each contract.yaml, and returns the union of declared topics. "That extractor runs in 3 independent places": (1) Pre-deploy CI (creates declared topics against the broker; smoke tests fail before merge), (2) Runtime boot (provisioner reads the same extractor; "if the contracts are silent, the provisioner is silent"), (3) Post-boot validation (queries broker, compares against extractor output, re-invokes provisioning if a declared topic is missing). "With multiple independent passes, a topic name can only be wrong in the contract. While the processes and lifecycle stages may differ, there is only one parser." See patterns/single-extractor-multi-call-site.

  6. Provisioner scope is intentionally narrow — creation only, no reconciliation. "It creates missing topics." Provisioning is "async, best-effort, and non-blocking." If a topic exists, "the provisioner leaves it alone. It does not: reconcile partition counts, reconcile replication factors, reconcile retention policies, mutate existing topics." "That boundary is intentional. Creation is contract-owned; reconciliation is a different problem." This is a candid scope-boundary disclosure about configuration drift: the system catches naming drift but not partition / replication / retention drift — that's the next gap, sketched as a future "diff against the contract spec, decide which mismatches are auto-correctable, surface the rest as explicit drift" layer.

  7. Cheapest-capable model routing with auto-escalation on quality failure. Most agent work doesn't need a frontier model; "classification, code generation, refactoring, and summarization — a local model running on hardware I already own handles them fine." Every task gets classified and routed to the cheapest model that can do it; "the expensive cloud models are a fallback for the hard cases, not the default." Routing decisions produce receipts (model chosen, token count, cost, compliance check). When the local model can't meet the bar ("output is too short, missing citations, or hallucinated identifiers"), the task automatically escalates to a stronger model. Disclosed week-of metrics: 75% of tokens never left the building (routed to four on-prem hosts at zero marginal cost); $3.37 cloud spend avoided vs $2.43 actually spent; 1.3% of delegations escalated. See concepts/cheapest-capable-model-routing + patterns/auto-escalation-on-quality-failure.

  8. Architectural through-line: single source of truth, validated, with no second hidden copy. "The same discipline that keeps topic names from drifting — one canonical source, validated, with no second hidden copy — is what lets me hand work to the cheapest model without hoping it went well. The decision is a contract. The receipt is the evidence. Neither lives in someone's head." The post pairs the topic-naming-contract and the routing-receipt as two instances of the same principle: when distributed independently-developed components have to agree on something, make the agreement reviewable (in code, in a contract file, on a receipt) rather than tribal knowledge.

Operational numbers disclosed

  • Scale at migration trigger: "5 repositories to 12"; "event catalog surpassed 100 event types".
  • Routing economics (last 7 days): 75% of tokens routed to on-prem hosts (zero marginal cost); $3.37 avoided cloud spend versus $2.43 actually spent; 1.3% of delegations escalated to stronger model when local couldn't meet bar.
  • On-prem fleet: "four on-prem hosts" (not further characterised — model class, GPU type, scheduler).
  • Development footprint: Redpanda "fits comfortably into an 8 GB development profile" in a single container.
  • Topic name shape: regex validates onex.{kind}.{producer}.{event}.v{N}.

Caveats

  • Tier-3 source, guest post: Redpanda blog publishing a customer's post; the Redpanda product is the framing, but the architecture content is OmniNode's. Tier-3 inclusion gate: the post passes scope on the contract-driven topic-naming discipline + extractor topology disclosure (substantive, reusable architecture content), not on the Redpanda angle per se.
  • No latency / throughput numbers: the post deliberately frames the migration trigger as coordination, not throughput, but also discloses no broker-side performance numbers — partition counts, replication factors, message rates, cluster topology are all undisclosed.
  • Single-binary / homelab deployment context: "the broker boots in a single container, fits comfortably into an 8 GB development profile." The contract topology is described in a context where Redpanda is intentionally a single binary running on a developer laptop / homelab; the post does not characterise whether the same discipline scales unchanged at a multi-broker / multi-AZ Kafka cluster footprint.
  • No reconciliation = drift is unsolved: explicitly named as the next gap. The system catches naming drift but not partition / replication / retention drift. Reconciliation is sketched (broker metadata diff vs contract spec) but unbuilt.
  • Routing receipt schema not disclosed: routing decisions produce receipts (model / tokens / cost / compliance), but the receipt's storage substrate, retention, and downstream-consumer shape (audit-only? feedback-loop input? billing input?) are not characterised.
  • Quality-bar mechanism only sketched: auto-escalation triggers on "output too short, missing citations, or hallucinated identifiers" — these are categorical examples, not a complete rubric. The bar's calibration / per-task-class threshold tuning / false-escalation rate are not disclosed.
  • No comparison to alternative migration targets: the post asserts Redpanda was chosen for "Kafka-API compatibility in a single binary" and lightweight dev-mode footprint, but does not compare against alternatives the team considered (Kafka itself, WarpStream, NATS JetStream) — the migration-target choice is asserted, not benchmarked.
  • Contract.yaml format / loader not disclosed: the post shows a YAML snippet (event_bus.subscribe_topics, event_bus.publish_topics) but does not disclose the schema or how ContractTopicExtractor enumerates "approved packages" — package allowlist, discovery mechanism, registry-of-registries shape are undisclosed.

Source

Last updated · 542 distilled / 1,571 read