Skip to content

SYSTEM Cited by 1 source

OmniNode

OmniNode is "a contract-driven runtime for AI agent orchestration" (Jonah Gray, founder/CEO) that coordinates fleets of agents that build, test, review, and deploy software across multiple repositories simultaneously. Every unit of work in the system is a node; every node ships a contract.yaml that declares "exactly what the node consumes and produces, what events it listens to, and what topics it publishes on the bus." The runtime uses these contracts to provision infrastructure, route tasks, validate completions, and enforce ordering across agents "that would otherwise have no idea what each other is doing."

The system is canonical on the wiki for two architectural disciplines distilled from the 2026-06-02 Redpanda guest post:

  1. Contract-driven topic naming — the contract.yaml is the only reviewed location where wire-format topic names live. Names follow a regex-validated shape onex.{kind}.{producer}.{event}.v{N}, with a StrEnum-backed canonical registry. A single ContractTopicExtractor parses the contracts and runs in three independent places (CI, runtime boot, post-boot validation) — see patterns/contract-driven-topic-provisioning + patterns/single-extractor-multi-call-site.
  2. Cheapest-capable model routing with auto-escalation — every AI-agent task is classified and routed to "the cheapest model that can actually do it"; quality failure auto-escalates to a stronger model; every routing decision produces a receipt (model / tokens / cost / compliance check). See concepts/cheapest-capable-model-routing + concepts/routing-receipt + patterns/auto-escalation-on-quality-failure.

Origin

The post discloses a candid founder origin: 16 years as an iOS developer (Objective-C through Swift) ended when neuropathy made 8–10 hours of precise keyboard work per day medically impossible. AI code generation solved code-writing but "writing code only makes up about a third of what software engineering actually involves" — PR review, log tracing, ticket-acceptance verification, rename-tracking still required keyboard time. Building agents to automate the rest exposed a coordination problem: "all of the agents I'd built were independently fast but chaotic. They would step on each other's changes, duplicate work, claim incomplete tasks were finished, or merge code that broke another agent's in-flight branch." OmniNode is the system that grew to coordinate those agents.

Architecture

The node + contract model

Every unit of work is a node. Every node ships a contract.yaml that owns the node's bus surface:

event_bus:
  subscribe_topics:
    - "onex.cmd.router.route-request.v1"
    - "onex.evt.router.scoring-decision.v1"
  publish_topics:
    - "onex.evt.router.routing-complete.v1"
    - "onex.evt.router.routing-failed.v1"

The runtime uses these contracts to: - Provision infrastructure (broker topics) — see patterns/contract-driven-topic-provisioning. - Route tasks — agents publish events when work is done and subscribe to events that trigger their next step. - Validate completions — pre/post compliance checks via the routing receipt system. - Enforce ordering — agents subscribe in the order their workflow requires; the bus serialises through topic offsets.

Topic-name shape is validated by regex + enum (concepts/regex-plus-enum-validation): onex.{kind}.{producer}.{event}.v{N}. The regex catches malformed names; a StrEnum-backed canonical registry catches names that are "syntactically valid but not canonical" (hyphen-vs-underscore drift, pluralisation drift, renamed event segments). "The contract becomes the only reviewed location where wire-format topic names live. There is no second operator-maintained registry, separate constant list hidden inside the runtime, or manually synchronized provisioning config. If a node wants the system to provision and validate a topic, it must put the topic name in its contract."

The event bus

The post: "The event bus is the spine of the whole system. Instead of agents calling each other directly, they publish events when work is done and subscribe to events that trigger their next step. Each agent knows only its own inputs and outputs, while the bus handles the rest." Disclosed example workflow:

A code review agent finishes reviewing a pull request and publishes a review-complete event → a merge agent is subscribed to that topic and picks it up → a CI watcher is subscribed to merge events and starts monitoring the pipeline.

Originally Redis Streams (systems/redis-streams) behind a transport-layer abstraction: "published with XADD, consumed with XREADGROUP, and kept topic names in Python constants near the code that used them. Apache Kafka was explicitly deferred in the roadmap because the system was still small."

Migrated to Redpanda at the 5 → 12 repos / 100+ event types scale point. The migration trigger was coordination, not throughput: consumer groups, partition-level parallelism, durable replay semantics, topic introspection, programmatic provisioning. Redpanda was chosen specifically because "Kafka-API compatibility in a single binary" lets the broker "boot in a single container, fits comfortably into an 8 GB development profile, and uses the same compose file everywhere" — local development, CI, dev containers, the homelab runtime all run the real broker. The architectural argument: "if the broker is operationally heavy, teams eventually stop running it locally. They fake the bus, mock topic creation, or maintain a second development path that doesn't actually validate topic identity."

The ContractTopicExtractor

A single extractor "discovers approved packages, loads each contract.yaml, and returns the union of declared topics." That extractor runs in three independent places — see patterns/single-extractor-multi-call-site:

  1. Pre-deploy CI — invokes the extractor and creates declared topics against the broker. Smoke tests fail before the change merges if a node adds a topic but forgets to declare it. "This is the earliest enforcement point: the contracts are validated against a real broker before the runtime ever starts."
  2. Runtime boot — the provisioner invokes the same extractor. "There is no fallback constant list or duplicated topic configuration. If the contracts are silent, the provisioner is silent."
  3. Post-boot validation — a startup validator queries the broker, compares against extractor output, and re-invokes provisioning if a declared topic is missing. "This is the cheapest recovery path in the system."

The slogan: "With multiple independent passes, a topic name can only be wrong in the contract. While the processes and lifecycle stages may differ, there is only one parser. Ultimately, being disciplined matters more than the infrastructure."

Provisioner scope

The runtime provisioner has "a deliberately narrow scope: it creates missing topics."

new_topic = NewTopic(
    name=spec.suffix,
    num_partitions=partitions,
    replication_factor=spec.replication_factor,
)
await admin.create_topics([new_topic])

Provisioning is "async, best-effort, and non-blocking." If a topic exists, the provisioner "leaves it alone." It does not: - reconcile partition counts - reconcile replication factors - reconcile retention policies - mutate existing topics

"If a topic was originally created with 6 partitions and the contract later requests 12, the provisioner will not notice. That boundary is intentional. Creation is contract-owned; reconciliation is a different problem." The post explicitly names partition count drift, replication-factor drift, and retention-policy drift as the next gap, sketching a future drift-detection-as-query layer: "the contract describes what should exist, while the broker reports what does exist. A materialized projection over both turns drift detection into a query instead of a script. That reconciliation layer is not built yet."

Routing layer (cheapest-capable model selection)

Every AI-agent task is classified and routed to "the cheapest model that can actually do it, and the result gets checked before it counts as done."

  • Local-first: classification, code generation, refactoring, and summarisation are handled by a local model on hardware the team already owns ("four on-prem hosts at zero marginal cost").
  • Cloud as fallback: "the expensive cloud models are a fallback for the hard cases, not the default."
  • Auto-escalation on quality failure: when the local model can't meet the bar ("output is too short, missing citations, or hallucinated identifiers"), the task automatically escalates to a stronger model. See patterns/auto-escalation-on-quality-failure.
  • Receipt per decision: every routing decision produces a receipt — "which model was chosen, how many tokens it took, what it cost, and whether the output passed its compliance checks." See concepts/routing-receipt.

Disclosed week-of metrics (last 7 days, OmniNode dashboard): - 75% of tokens never left the building (on-prem-routed). - $3.37 cloud spend avoided vs $2.43 actually spent. - 1.3% of delegations escalated to a stronger model.

The architectural through-line: "the same discipline that keeps topic names from drifting — one canonical source, validated, with no second hidden copy — is what lets me hand work to the cheapest model without hoping it went well. The decision is a contract. The receipt is the evidence. Neither lives in someone's head."

Disclosed agent fleet

The post names the agents the founder built before OmniNode existed as a system:

  • Open pull requests
  • Run the test suite and make sense of the output
  • Review the code written by the first agent and flag potential issues
  • Verify that a ticket's acceptance criteria were actually met

Plus the post's downstream example: - Code review agent (publishes review-complete) - Merge agent (subscribes to review-complete) - CI watcher (subscribes to merge events, monitors pipeline)

The set is illustrative, not exhaustive — "soon I had dozens of agents."

Operational scale

  • Repositories: 5 (initial) → 12 (current).
  • Event types: "surpassed 100" — the inflection point that triggered the Redis Streams → Redpanda migration.
  • On-prem fleet: four hosts (model class / GPU type / scheduler not disclosed).
  • Broker footprint: single Redpanda binary in single container, "fits comfortably into an 8 GB development profile" — same compose file in local dev / CI / dev containers / homelab runtime.

Caveats

  • The post is a Redpanda-blog guest post; the architectural disclosure is OmniNode's, but the framing emphasises Redpanda's role.
  • Broker-side scale numbers (partition counts per topic, message rates, replication factor, multi-AZ topology) are undisclosed.
  • The contract.yaml schema beyond the event_bus.subscribe_topics / event_bus.publish_topics block is undisclosed; the post shows the bus surface only.
  • The ContractTopicExtractor's package-discovery / allowlist mechanism is undisclosed — what makes a package "approved" is not characterised.
  • Reconciliation is acknowledged as the next gap and unbuilt.
  • The routing receipt's storage / retention / downstream-consumer shape is undisclosed.
  • The auto-escalation quality-bar is sketched ("too short, missing citations, hallucinated identifiers") but not fully rubric'd.

Seen in

  • sources/2026-06-02-redpanda-how-omninode-uses-redpanda-to-scale-ai-agent-workflows (2026-06-02, Redpanda Blog guest post by Jonah Gray) — canonical disclosure source for OmniNode on the wiki. Provides: founder-origin framing (neuropathy → AI code generation → agent proliferation → coordination problem → OmniNode); Redis Streams → Redpanda migration with scale, not throughput as trigger; contract-driven topic naming with regex+enum validation; one- extractor-three-call-sites topology (CI / runtime boot / post- boot validation); narrow-scope provisioner with explicit no- reconciliation boundary; cheapest-capable model routing with routing receipts; concrete week-of metrics (75% on-prem, 1.3% escalation, $3.37 vs $2.43).
Last updated · 542 distilled / 1,571 read