Redpanda — How OmniNode uses Redpanda to scale AI agent workflows¶
A Redpanda Blog guest post (2026-06-02) by Jonah Gray, founder and CEO of OmniNode, on the migration of OmniNode's multi-agent coordination bus from Redis Streams to Redpanda at the ~100 event types / 12 repositories scale point, and on the contract-driven topic-naming discipline the team adopted when the topic name became the load-bearing coordination surface across independently-developed agents. Tier-3 source (Redpanda blog, guest-post genre) included on architecture-content grounds: the post is structurally an engineering narrative — "we used Redis Streams, then scale changed the problem, then we moved to Redpanda, then naming became the bug surface, then we made the contract own the names" — with concrete code, a named bug class (silent wiring failure from hyphen vs underscore), a disclosed extractor-and-three-call-sites topology, and a candid scope boundary ("creation is contract-owned; reconciliation is a different problem" — drift is acknowledged but unsolved).
A second strand of the post discloses OmniNode's routing layer: every AI-agent task is classified and routed to "the cheapest model that can actually do it" with auto-escalation on quality failures; every routing decision produces a receipt (model chosen, tokens, cost, compliance check). Disclosed week-of-numbers: "75% of tokens never left the building" (routed to four on-prem hosts at zero marginal cost), "$3.37 in cloud spend was avoided, compared to $2.43 actually spent", "1.3% of delegations" escalated to a stronger model.
Summary¶
OmniNode is "a contract-driven runtime for AI agent orchestration"
that coordinates fleets of agents that build, test, review, and
deploy software across multiple repositories simultaneously. Every
unit of work is a node; every node ships a contract.yaml that
declares "exactly what the node consumes and produces, what events
it listens to, and what topics it publishes on the bus." The
runtime uses these contracts to provision infrastructure, route
tasks, validate completions, and enforce ordering across agents
"that would otherwise have no idea what each other is doing." The
event bus is "the spine of the whole system" — agents publish
events when work is done and subscribe to events that trigger their
next step; "each agent knows only its own inputs and outputs, while
the bus handles the rest."
OmniNode originally used Redis Streams behind a transport-layer
abstraction (publish via XADD, consume via XREADGROUP, topic
names kept "in Python constants near the code that used them").
Apache Kafka was "explicitly deferred in the roadmap because the
system was still small." Then "the system grew from 5 repositories
to 12, and the event catalog surpassed 100 event types." The
migration trigger was not throughput — it was coordination:
consumer groups, partition-level parallelism, durable replay
semantics, topic introspection, programmatic provisioning. Redpanda
gave them "Kafka-API compatibility in a single binary"; the
transport abstraction made the broker swap "straightforward."
The unexpected challenge was topic identity. With Redis Streams,
"topics were usually owned end-to-end by a single developer" —
producer and consumer lived close together. With 100 Kafka-shaped
topics spread across independent repositories, "the topic name
became the critical coordination surface" — the only thing
connecting one agent's output to another agent's input. The
disclosed bug: a producer published to
onex.evt.router.routing-complete.v1; the consumer subscribed to
onex.evt.router.routing_complete.v1. "Both services started
cleanly, both topic names were accepted, and nothing failed — yet
the routing pipeline silently stopped working." Same shape kept
reappearing: pluralization differences, underscores versus
hyphens, version suffix mismatches, renamed event segments, old
topics left behind after refactors — "the silence was the failure
mode." The team realised this was "a naming problem before it was
a schema problem."
The fix: the contract owns the topic names. Every node ships a
contract.yaml with a subscribe_topics list and a
publish_topics list — "the contract owns the node's bus
surface." Topic names follow a regex-validated shape
onex.{kind}.{producer}.{event}.v{N}; "a regex validates the
structure, and a StrEnum backs the canonical registry. The regex
catches malformed names, while the enum catches names that are
syntactically valid but not canonical." Crucially, "the contract
becomes the only reviewed location where wire-format topic names
live. There is no second operator-maintained registry, separate
constant list hidden inside the runtime, or manually synchronized
provisioning config."
The implementation enforces this discipline via a single
ContractTopicExtractor "that discovers approved packages, loads
each contract.yaml, and returns the union of declared topics" —
run in three independent places: (1) Pre-deploy CI
("creates the declared topics against the broker"; smoke tests
fail before the change merges), (2) Runtime boot ("there is
no fallback constant list… If the contracts are silent, the
provisioner is silent"), (3) Post-boot validation ("if a
declared topic is missing, say because topic creation partially
failed during startup, the validator re-invokes provisioning").
The slogan: "a topic name can only be wrong in the contract.
While the processes and lifecycle stages may differ, there is only
one parser. Ultimately, being disciplined matters more than the
infrastructure."
The provisioner has "a deliberately narrow scope: it creates missing topics." Provisioning is "async, best-effort, and non-blocking." If a topic exists, "the provisioner leaves it alone. It does not: reconcile partition counts, reconcile replication factors, reconcile retention policies, mutate existing topics." If a topic was originally created with 6 partitions and the contract later requests 12, "the provisioner will not notice." This boundary is intentional — "creation is contract-owned; reconciliation is a different problem." The post explicitly names configuration drift (partition count, replication factor, retention policy) as the next gap and sketches a future drift-detection-as-query layer ("the contract describes what should exist, while the broker reports what does exist. A materialized projection over both turns drift detection into a query instead of a script. That reconciliation layer is not built yet.").
A second strand of the post discloses OmniNode's routing discipline: every AI-agent task gets classified and routed to "the cheapest model that can actually do it, and the result gets checked before it counts as done." Most agent work — classification, code generation, refactoring, summarization — "a local model running on hardware I already own handles them fine. The expensive cloud models are a fallback for the hard cases, not the default." Disclosed week-of metrics from the OmniNode dashboard: - 75% of tokens never left the building (routed to four on-prem hosts at zero marginal cost) - $3.37 in cloud spend was avoided, compared to $2.43 actually spent - "At a larger scale, that ratio is the whole business case." - 1.3% of delegations escalated automatically to a stronger model when the local model couldn't meet the bar (output too short, missing citations, or hallucinated identifiers) - Every routing decision "produces a receipt: which model was chosen, how many tokens it took, what it cost, and whether the output passed its compliance checks" — "the receipt is the evidence. Neither lives in someone's head."
The architectural through-line connecting both strands is single-source-of-truth discipline: "the same discipline that keeps topic names from drifting — one canonical source, validated, with no second hidden copy — is what lets me hand work to the cheapest model without hoping it went well. The decision is a contract. The receipt is the evidence."
Key takeaways¶
-
Redis Streams → Kafka migration trigger was coordination, not throughput. "We outgrew Redis Streams not because of throughput, but because coordination itself became difficult." At 5 repos / single-developer-per-topic ownership, Redis Streams handled the message patterns fine. At 12 repos / 100+ event types with cross-repo subscribe/publish patterns, the team needed consumer groups, partition-level parallelism, durable replay semantics, topic introspection, and programmatic provisioning — the load-bearing capabilities the Kafka model gives you that Redis Streams does not. Redpanda was chosen specifically because "Kafka-API compatibility in a single binary" lets the broker "boot in a single container, fits comfortably into an 8 GB development profile, and uses the same compose file everywhere" so "the broker exists everywhere that code executes: local development, CI, dev containers, the homelab runtime". The architectural argument: "if the broker is operationally heavy, teams eventually stop running it locally. They fake the bus, mock topic creation, or maintain a second development path that doesn't actually validate topic identity."
-
At Kafka-shaped scale across independent repos, the topic name IS the coordination surface. "In our architecture, the topic name was the only thing connecting one agent's output to another agent's input." This is the canonical topic name as coordination surface disclosure on the wiki: when a system scales from co-located producer/consumer to distributed producer/consumer pairs, the topic name takes on a load-bearing role it didn't have before, and naming-drift becomes a silent failure mode.
-
Silent wiring failure is the canonical bug class.
onex.evt.router.routing-complete.v1(producer) vsonex.evt.router.routing_complete.v1(consumer): "both services started cleanly, both topic names were accepted, and nothing failed — yet the routing pipeline silently stopped working. The silence was the failure mode." Five disclosed sub-shapes: pluralization differences, underscores vs hyphens, version-suffix mismatches, renamed event segments, and "old topics left behind after refactors". "Every instance had the same shape, where both names were well-formed and both operations succeeded." See concepts/silent-wiring-failure for the named bug class. -
Contract.yaml owns the bus surface — the contract is the only reviewed location where wire-format topic names live. Per-node
contract.yamldeclaressubscribe_topics:andpublish_topics:lists; topic names follow a regex-validated shapeonex.{kind}.{producer}.{event}.v{N}; aStrEnumbacks the canonical registry. "The regex catches malformed names, while the enum catches names that are syntactically valid but not canonical." Critically: "there is no second operator-maintained registry, separate constant list hidden inside the runtime, or manually synchronized provisioning config. If a node wants the system to provision and validate a topic, it must put the topic name in its contract." See patterns/contract-driven-topic-provisioning for the pattern. -
One extractor, three independent call sites — the topology is the point. A single
ContractTopicExtractordiscovers approved packages, loads eachcontract.yaml, and returns the union of declared topics. "That extractor runs in 3 independent places": (1) Pre-deploy CI (creates declared topics against the broker; smoke tests fail before merge), (2) Runtime boot (provisioner reads the same extractor; "if the contracts are silent, the provisioner is silent"), (3) Post-boot validation (queries broker, compares against extractor output, re-invokes provisioning if a declared topic is missing). "With multiple independent passes, a topic name can only be wrong in the contract. While the processes and lifecycle stages may differ, there is only one parser." See patterns/single-extractor-multi-call-site. -
Provisioner scope is intentionally narrow — creation only, no reconciliation. "It creates missing topics." Provisioning is "async, best-effort, and non-blocking." If a topic exists, "the provisioner leaves it alone. It does not: reconcile partition counts, reconcile replication factors, reconcile retention policies, mutate existing topics." "That boundary is intentional. Creation is contract-owned; reconciliation is a different problem." This is a candid scope-boundary disclosure about configuration drift: the system catches naming drift but not partition / replication / retention drift — that's the next gap, sketched as a future "diff against the contract spec, decide which mismatches are auto-correctable, surface the rest as explicit drift" layer.
-
Cheapest-capable model routing with auto-escalation on quality failure. Most agent work doesn't need a frontier model; "classification, code generation, refactoring, and summarization — a local model running on hardware I already own handles them fine." Every task gets classified and routed to the cheapest model that can do it; "the expensive cloud models are a fallback for the hard cases, not the default." Routing decisions produce receipts (model chosen, token count, cost, compliance check). When the local model can't meet the bar ("output is too short, missing citations, or hallucinated identifiers"), the task automatically escalates to a stronger model. Disclosed week-of metrics: 75% of tokens never left the building (routed to four on-prem hosts at zero marginal cost); $3.37 cloud spend avoided vs $2.43 actually spent; 1.3% of delegations escalated. See concepts/cheapest-capable-model-routing + patterns/auto-escalation-on-quality-failure.
-
Architectural through-line: single source of truth, validated, with no second hidden copy. "The same discipline that keeps topic names from drifting — one canonical source, validated, with no second hidden copy — is what lets me hand work to the cheapest model without hoping it went well. The decision is a contract. The receipt is the evidence. Neither lives in someone's head." The post pairs the topic-naming-contract and the routing-receipt as two instances of the same principle: when distributed independently-developed components have to agree on something, make the agreement reviewable (in code, in a contract file, on a receipt) rather than tribal knowledge.
Operational numbers disclosed¶
- Scale at migration trigger: "5 repositories to 12"; "event catalog surpassed 100 event types".
- Routing economics (last 7 days): 75% of tokens routed to on-prem hosts (zero marginal cost); $3.37 avoided cloud spend versus $2.43 actually spent; 1.3% of delegations escalated to stronger model when local couldn't meet bar.
- On-prem fleet: "four on-prem hosts" (not further characterised — model class, GPU type, scheduler).
- Development footprint: Redpanda "fits comfortably into an 8 GB development profile" in a single container.
- Topic name shape: regex validates
onex.{kind}.{producer}.{event}.v{N}.
Caveats¶
- Tier-3 source, guest post: Redpanda blog publishing a customer's post; the Redpanda product is the framing, but the architecture content is OmniNode's. Tier-3 inclusion gate: the post passes scope on the contract-driven topic-naming discipline + extractor topology disclosure (substantive, reusable architecture content), not on the Redpanda angle per se.
- No latency / throughput numbers: the post deliberately frames the migration trigger as coordination, not throughput, but also discloses no broker-side performance numbers — partition counts, replication factors, message rates, cluster topology are all undisclosed.
- Single-binary / homelab deployment context: "the broker boots in a single container, fits comfortably into an 8 GB development profile." The contract topology is described in a context where Redpanda is intentionally a single binary running on a developer laptop / homelab; the post does not characterise whether the same discipline scales unchanged at a multi-broker / multi-AZ Kafka cluster footprint.
- No reconciliation = drift is unsolved: explicitly named as the next gap. The system catches naming drift but not partition / replication / retention drift. Reconciliation is sketched (broker metadata diff vs contract spec) but unbuilt.
- Routing receipt schema not disclosed: routing decisions produce receipts (model / tokens / cost / compliance), but the receipt's storage substrate, retention, and downstream-consumer shape (audit-only? feedback-loop input? billing input?) are not characterised.
- Quality-bar mechanism only sketched: auto-escalation triggers on "output too short, missing citations, or hallucinated identifiers" — these are categorical examples, not a complete rubric. The bar's calibration / per-task-class threshold tuning / false-escalation rate are not disclosed.
- No comparison to alternative migration targets: the post asserts Redpanda was chosen for "Kafka-API compatibility in a single binary" and lightweight dev-mode footprint, but does not compare against alternatives the team considered (Kafka itself, WarpStream, NATS JetStream) — the migration-target choice is asserted, not benchmarked.
- Contract.yaml format / loader not disclosed: the post shows a
YAML snippet (
event_bus.subscribe_topics,event_bus.publish_topics) but does not disclose the schema or howContractTopicExtractorenumerates "approved packages" — package allowlist, discovery mechanism, registry-of-registries shape are undisclosed.
Source¶
- Original: https://www.redpanda.com/blog/omninode-scale-ai-agent-workflows
- Raw markdown:
raw/redpanda/2026-06-02-how-omninode-uses-redpanda-to-scale-ai-agent-workflows-39c31bb4.md
Related¶
- systems/omninode — the contract-driven AI agent runtime this post canonicalises.
- systems/redpanda — destination broker; the Kafka-API single-binary affordability is the load-bearing property the topic-identity discipline depends on.
- systems/redis-streams — the originally-deployed messaging primitive, retired at the 100-topic / 12-repo scale point.
- concepts/topic-name-as-coordination-surface — when topic names become the only thing connecting independently-developed agent producers and consumers.
- concepts/silent-wiring-failure — the named bug class (hyphen vs underscore, plural drift, version-suffix mismatch).
- concepts/contract-yaml-as-bus-surface — per-node manifest declaring subscribe / publish topics.
- concepts/regex-plus-enum-validation — regex catches malformed, enum catches non-canonical.
- concepts/cheapest-capable-model-routing — per-task routing to the cheapest model that can do it.
- concepts/routing-receipt — every routing decision produces an audit-grade record.
- patterns/contract-driven-topic-provisioning — contract.yaml is the single source of truth for topic provisioning across CI, runtime boot, and post-boot validation.
- patterns/single-extractor-multi-call-site — one parser, many consumers; correctness via topology.
- patterns/auto-escalation-on-quality-failure — local-model output checked against quality bar; escalates on failure.