Skip to content

SYSTEM Cited by 24 sources

Redpanda

Redpanda is a ground-up C++ rewrite of a Kafka-API-compatible streaming broker, built on the thread-per-core Seastar framework with Raft-based replication. Because Redpanda implements Kafka's wire protocol, every Kafka client (Java KafkaProducer, librdkafka, kafka-python, etc.) interacts with Redpanda identically to Apache Kafka — including the producer-side batching, partitioning, and acknowledgment semantics.

Canonical wiki entry points:

Architecture (stub — expand)

  • Thread-per-core runtime. Built on Seastar; each CPU core owns a shard of partitions and runs a single thread with cooperative task scheduling. No shared state across cores.
  • Raft replication. Per-partition Raft groups replace Kafka's ISR-based replication model. Leader election is bounded by Raft election timeouts rather than ZooKeeper/KRaft control plane.
  • Kafka wire protocol. Full client compatibility. Producer semantics — linger.ms, batch.size, buffer.memory, max.in.flight.requests.per.connection, sticky partitioner, acks=0/1/all — match Kafka client-side behaviour identically.
  • Tiered storage. Offload historical segments to object stores (S3, GCS). Stub — deferred to future source ingests.
  • Iceberg topics (2024; GA 25.1, 2025-04-07, multi-cloud on AWS/Azure/GCP). Topic-level integration with Apache Iceberg — a single logical entity is both a Kafka-protocol topic and an Iceberg table. See systems/redpanda-iceberg-topics for the substantive entry; concepts/iceberg-topic for the concept.

Redpanda 25.3 (2025-11, preview)

The 25.3 release preview post (2025-11-06) introduces four headline features across three architectural axes:

(Source: sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more)

Iceberg topics (lakehouse-native Bronze sink)

Redpanda's Iceberg topics let a topic double as an Iceberg table without any external ETL job: producers write records via the normal Kafka producer API; the broker transparently projects records into columnar Parquet on object storage and updates an external Iceberg REST catalog (Databricks Unity, Snowflake Polaris). Downstream Iceberg-aware engines — ClickHouse, Snowflake, Databricks, Trino, Spark, Flink — query the tables directly.

Architecturally, this positions Redpanda as the Bronze tier of a Medallion Architecture lakehouse without an intermediate integration cluster (Kafka Connect / Redpanda Connect / custom Airflow jobs). Canonical wiki pattern: patterns/streaming-broker-as-lakehouse-bronze-sink.

(Source: sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda)

Iceberg Topics GA (25.1, 2025-04-07)

Iceberg Topics were promoted from preview to General Availability in Redpanda's 25.1 release, with simultaneous availability on AWS, Azure, and GCP. The GA post (sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available) discloses nine named properties that distinguish the GA surface from the pre-GA preview: four table-management capabilities (custom hierarchical bucketed partitioning, built-in dead-letter queues, full Iceberg-spec-compliant schema evolution, automatic snapshot expiry) and five catalog-integration capabilities (secure REST catalog sync via OIDC+TLS, transactional writes, automatic table discovery and registration, built-in object- store catalog fallback, tunable workload management for the snapshot-to-topic lag ceiling). The GA disclosure retires two prior wiki caveats (snapshot-expiry ownership is broker-owned; Iceberg-spec schema evolution is fully supported); small-file compaction ownership remains an open question.

The 25.1 release bundles several additional features adjacent to the streaming substrate: native consumer group lag metrics (Prometheus-exposed; replaces a previously documented PromQL compute), Protobuf schema normalization in the Schema Registry, SASL/PLAIN authentication, unified Console+cluster identity with fine-grained RBAC, and platform- centric versioning for Kubernetes deployments with FluxCD removal to reduce conflicts with customer FluxCD installations.

Kafka-API-compatible batching semantics

From the 2024-11-19 batch-tuning explainer:

"Just as in Apache Kafka, a batch in Redpanda is a group of one or more messages written to the same partition, which are bundled together and sent to a broker in a single request. Rather than each message being sent and acknowledged separately, requiring multiple calls to Redpanda, the client buffers messages for a short time, optionally compresses the whole batch, and then sends them later as a single request." (Source: sources/2024-11-19-redpanda-batch-tuning-in-redpanda-for-optimized-performance-part-1)

The three producer knobs — linger.ms, batch.size, buffer.memory — compose identically on Redpanda and Kafka. The seven-factor effective-batch-size framework (message rate, batch.size, linger.ms, partitioning, producer fan-out, client buffer memory, backpressure) applies identically to both systems.

The CPU-saturation latency inversion (concepts/batching-latency-tradeoff) — where increasing linger.ms reduces tail latency under broker saturation by shrinking the internal work-queue backlog — is canonicalised from the Redpanda explainer but is equally true of Kafka brokers.

High availability: multi-region stretch clusters

From the 2025-02-11 stretch-clusters post, the canonical wiki statement of Redpanda's region-spanning HA/DR shape:

"A multi-region Redpanda cluster is a deployment topology that allows customers to run a single Redpanda cluster across multiple data centers or multiple cloud regions. It's often referred to as a stretch cluster, where a single cluster stretches across multiple geographic regions with data distributed across all deployment regions. Data is replicated synchronously via raft protocol between brokers distributed across multiple regions."

(Source: sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters)

Redpanda's stretch-cluster shape — single control plane, one per-partition Raft group spanning regions — achieves RPO=0 on region loss via automatic Raft re-election in surviving regions. The canonical wiki concept is concepts/multi-region-stretch-cluster; the pattern is patterns/multi-region-raft-quorum; the alternative lower-cost shape is MirrorMaker2 between two independent clusters (non-zero RPO).

Four operator knobs mitigate cross-region cost — all canonicalised from the same post:

Deployment uses region-as-rack via Ansible, reusing the same rack-awareness machinery as multi-AZ. Performance testing substrate: OMB + tc inter-broker latency injection. Kubernetes-operator gap as of the post: multi-region stretch is not supported on K8s (only VMs / bare metal / cloud compute / Redpanda Cloud).

Agent infrastructure (2025-04)

As of the 2025-04-03 founder-voice autonomy post, Redpanda positions the broker as the durable-log substrate for enterprise AI agents — agent-to-agent communication, human-in-the-loop workflows, trace capture, evaluation replay, message sampling, collaborative threads, time-travel debugging all backed by the distributed log. The canonical wiki statement of this positioning is "the truth is the log" — Alex Gallego's citation of Kleppmann's 2015 "database inside out" framing as Redpanda's founding premise.

Three product-surface components ship alongside the broker:

(Source: sources/2025-04-03-redpanda-autonomy-is-the-future-of-infrastructure)

Agentic Data Plane (2025-10-28 productization)

Seven months after the autonomy-essay founder-voice framing, Gallego's 2025-10-28 Introducing the Agentic Data Plane names the commercial packaging of enterprise autonomy: Agentic Data Plane (ADP)"a unified runtime and control plane that safely exposes enterprise data to AI agents". Four layers composed over the existing Redpanda streaming substrate:

  • (A) Streaming — the Redpanda broker itself, substrate for durable execution, HITL async mailboxes, durable model replay, and observability event capture.
  • (B) Query engineOxla, a newly-acquired C++ distributed query engine with PostgreSQL wire protocol, separated compute- storage, and Iceberg-native workload targeting. "SQL is the best mechanism to filter and aggregate while the model summarizes." Early preview mid- December 2025; rolling integration into the product.
  • (C) Connectors — the existing 300+ Redpanda Connect catalog rebadged as ADP's integration layer.
  • (D) Governance — net-new global policy + observability layer enforcing governed agent data access. Concrete substrate: "OBO to task-based authentication, DLP hooks, per-agent consent workflows, and immutable audit trails with configurable retention." The first shipped feature is "Remote MCP + authentication + authorization for OBO (on-behalf-of) workloads with IdP integration" — canonical wiki instance of OBO agent authorization.

Product roadmap announced in the same post:

  • Agent templates for common enterprise data sources (Git for code repos, Jira, GDrive).
  • Declarative Agent Runtime as opinionated layer above Redpanda Agents SDK.
  • Oxla acquisition — integrated operationally via rpk oxla CLI.

Gallego's governance-first framing inverts typical agent-product marketing verbatim: "The fear from CIOs is not the code of the agent itself, it is governance. In simple terms, it is access controls: can I trust that data is accessed by the right things? And observability: when things go wrong, can I understand what happened?" — canonicalised on the wiki as concepts/governed-agent-data-access.

Redpanda SQL (Oxla productisation, GA 2026-05-27)

The third pillar of the Redpanda Data Platform — "Streaming, Connect, and SQL" — reaches GA on 2026-05-27 (Source: 2026-05-27 Redpanda SQL is GA). Redpanda SQL is the productised GA face of the Oxla MPP query engine acquired 2025-10-28; the acquisition → mid-December 2025 preview → 2026-05-27 GA arc is complete.

GA scope: Redpanda BYOC on AWS, consumption-based plans only. GCP BYOC + BYOVPC: "coming soon". Self-Managed: 2H FY27. Activation is three steps with no cluster restart from the Redpanda Console cluster overview page.

Four GA properties (full canonicalisation on systems/redpanda-sql):

  • In-cluster, in-VPCconcepts/in-cluster-streaming-sql / patterns/in-vpc-query-engine-on-streaming-substrate. Redpanda SQL runs on the same BYOC infrastructure as the brokers and Iceberg storage, inside the customer's VPC; "every query accesses data in-place". Closes the analytical-compute gap in BYOC compliance stories: pre-Redpanda-SQL, BYOC kept storage in-VPC but analytical queries required egress to a third-party warehouse. Redpanda SQL closes that gap.
  • Postgres wire protocolconcepts/postgres-wire-protocol-as-streaming-sql-surface. "It's just Postgres." Connect with psql, DBeaver, DataGrip, or Redpanda Console SQL Studio. The same architectural move Redpanda made with Kafka wire protocol on the broker side, applied to the SQL surface.
  • Transparent two-tier query bridgeconcepts/two-tier-stream-iceberg-query-bridge / patterns/transparent-hot-cold-tier-query. A single SQL statement reads transparently across the live broker tier and the Iceberg Topics cold-tier Parquet files; the engine plans the unified read path. Substrate-dependent on Iceberg Topics' simultaneous-write property.
  • MPP execution from Oxla"Massively Parallel Processing" C++ engine; same implementation language as the streaming broker; designed for OLAP throughput with extreme memory efficiency.

Five workload classes named at launch (full enumeration on systems/redpanda-sql): streaming-app debugging, real-time operational analytics, ad-hoc analytics, compliance queries, agent-driven query fan-out (concepts/agent-driven-query-fan-out — humans serial, agents parallel; "hundreds of queries simultaneously").

Explicit foil against ksqlDB at the ad-hoc vs predefined axis: "ksqlDB is a handy tool, but it requires you to decide what questions you're going to ask before the events arrive."

The launch reframes the Redpanda Data Platform from a streaming vendor into a complete data-platform vendor: "One architecture. One operational model. One vendor." — the positioning answer to Confluent's Kora + Flink + Tableflow and to Kafka + ETL + Snowflake.

Performance tuning checklist (2025-04)

The 2025-04-23 "Need for speed: 9 tips to supercharge Redpanda" post (Source: sources/2025-04-23-redpanda-need-for-speed-9-tips-to-supercharge-redpanda) provides an omnibus performance-tuning checklist for Redpanda clusters, organised across three dependency layers:

Infrastructure (tips 1–2) — deploy on local NVMe; run brokers on dedicated hardware with no noisy neighbors; give Redpanda 95% of available resources (leave 5% for OS / k8s host). When NVMe isn't available (SSD, spinning disks, SAN, remote storage), enable broker-side write caching — always paired with acks=all to preserve the quorum-memory durability guarantee. Canonical framing of write-caching as the hardware-shortfall mitigation (complementing the earlier Kinley 2024-11-26 organisational framing).

Data architecture (tips 3, 8, 9) — partition skew kills parallelisation (Amdahl's Law): use the sticky partitioner for unkeyed records; only use keyed partitioning when required (CDC); pick high-cardinality keys when keys are unavoidable (patterns/high-cardinality-partition-key). Don't compress compacted topics unless you accept the decompress/recompress CPU tax (concepts/compression-compaction-cpu-cost). Use tiered storage not just for capacity but for orders-of-magnitude faster decommission and recommission — data already in object storage doesn't need to re-replicate.

Application design (tips 4–7) — tune producer batching via linger.ms + batch.size; tune consumer fetches with a four-parameter matrix (fetch.min.bytes, fetch.max.wait.ms, max.partition.fetch.bytes, max.poll.records) pivoted on low-latency vs high-throughput regime; control offset-commit cost — each commit is a write to __consumer_offsets, so auto.commit.interval.ms ≥ 1 s (low-ms is "right out") and one consumer group per service. Compress on the client, not the broker (patterns/client-side-compression-over-broker-compression); prefer ZSTD or LZ4 as the codec-CPU-vs-ratio sweet spot.

Kubernetes deployment

Redpanda supports two deployment paths on Kubernetes: the Helm chart (simple, template- driven, limited lifecycle automation) or the production-grade Redpanda Operator (CRD-driven, managed upgrades + dynamic configuration + lifecycle automation + multi-tenancy). The operator is the default recommendation.

The operator has consolidated across 2025. Prior state: two separate operators — an internal one for Redpanda Cloud + BYOC, and a customer-facing one for Self-Managed. The customer operator initially bundled FluxCD internally to wrap the Helm chart — canonical wiki instance of the bundled-GitOps- dependency anti-pattern. The 2025 consolidation retired that structure across three branches:

  • v2.3.x — FluxCD optional (spec.chartRef.useFlux).
  • v2.4.x (Jan 2025) — FluxCD disabled by default.
  • v25.1.x — FluxCD + Helm-chart wrapping removed; unified operator serving both Cloud and Self-Managed. Adopts the version-aligned compatibility scheme — operator/chart version matches Redpanda core version with ±1 minor window, retiring the compatibility matrix.

Canonical pattern: patterns/unified-operator-for-cloud-and-self-managed.

Deployment-shape limitation: per the 2025-02-11 stretch-cluster post, "Self-Managed on K8s currently supports only multi-AZ deployments" — multi-region stretch is VMs / bare metal / cloud compute / Redpanda Cloud only.

(Source: sources/2025-05-06-redpanda-a-guide-to-redpanda-on-kubernetes)

FIPS compliance (broker-level, 2025-05-20)

As of Redpanda's 2025-05-20 Implementing FIPS compliance in Redpanda post, Redpanda brokers can operate under a FIPS cryptographic boundary for deployments into US federal / regulated environments.

  • Substrate: OpenSSL 3.0.9FIPS 140-2 validated; 140-3 validation under NIST review at post publication. Late-2025 upgrade target: OpenSSL 3.1.2 (FIPS 140-3 validated) ahead of 140-2 sunset. Both redpanda broker binary and rpk CLI consume the validated module.
  • Artefact distribution: two packages install alongside base Redpanda — redpanda-fips (OpenSSL FIPS module) and redpanda-rpk-fips (FIPS-compliant rpk). RPM + Debian at post publication.
  • Config dial: three-state fips_mode in redpanda.yaml: disabled / enabled / permissive, plus openssl_config_file + openssl_module_directory paths. enabled is the production setting; permissive is a dev-ergonomics affordance allowing broker-level FIPS logic without requiring OS-level FIPS (canonical warning: "anything crypto-related that relies on the operating system (such as sourcing entropy) may not be in full compliance").
  • Enforcement: broker startup fail-fast"Redpanda will log an error and exit if the underlying operating system isn't properly configured." No silent downgrade; the cluster either passes the boundary at startup or hard-fails. Structurally stronger than the logging-then- enforcement progressive-rollout shape by design — regulated workloads have no warn-only regime.
  • OS precondition (RHEL 8+): fips-mode-setup --enable → reboot → fips-mode-setup --check reports "FIPS mode is enabled". Only then does fips_mode: enabled in Redpanda succeed.
  • Deployment automation: Redpanda Ansible Collection accepts -e "enable_fips=true" -e "fips_mode=enabled" on the provision-cluster.yml playbook to pull FIPS binaries and write the FIPS redpanda.yaml.
  • Boundary scope at publication: self-managed RPM / Debian only. Redpanda Cloud, Kubernetes deployments, and systems/redpanda-connect are on the roadmap — a canonical wiki instance of the FIPS boundary being narrower than a product's full deployment surface because validated-module distribution is deployment-shape-specific.

License-gated enterprise feature.

(Source: sources/2025-05-20-redpanda-implementing-fips-compliance-in-redpanda)

GCP outage response (Redpanda Cloud, 2025-06-20 retrospective)

Redpanda's 2025-06-20 retrospective on the 2025-06-12 GCP global outage (Source: sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage) discloses Redpanda Cloud's substrate posture during a cascading cloud-provider event. The post's load-bearing disclosures:

  • Cell-based architecture as an explicit product principle. "Redpanda Cloud clusters do not externalize their metadata or any other critical services. All the services needed to write and read data, manage topics, ACLs, and other Kafka entities are co-located, with Redpanda core leading the way with its single-binary architecture." Explicit contrast: "other products boasting centralized metadata and a diskless architecture likely experienced the full weight of this global outage."
  • Availability SLA structural composition. The 99.99% SLA (with ≥99.999% design target) decomposes to six concrete substrate choices:
    • Replication factor ≥ 3 enforced on all topics (customers cannot lower, only increase).
    • Local NVMe primary storage + async tiered storage as fallback, not primary. Object-store errors don't block writes.
    • Redundant Kafka API + Schema Registry + Kafka HTTP Proxy.
    • No critical-path external dependencies beyond VPC + compute nodes + locally-attached disks (with PSC-enabled deployments as the named exception).
    • Continuous chaos + load testing of each cluster tier.
    • Release-engineering discipline with feedback- control-loop-guarded phased rollouts"we try to close our feedback control loops by watching Redpanda metrics as the phased rollout progresses and stopping when user-facing issues are detected."
  • Private Service Connect (PSC) is the named dependency exception. When PSC is enabled, it becomes part of the critical path for read / write. Canonicalises the one deployment shape where the "no external dependencies" claim does not strictly hold.
  • Deliberate disk reserve — unused + used-but-reclaimable NVMe space kept available for reclamation during tiered-storage stress.
  • Hedged observability — self-hosted data, third-party for dashboarding and alerting; the 2024 migration paid off during the 2025-06-12 cascading outage where the third-party was partially affected but self-hosted substrate stayed queryable.
  • Single node lost during the outage — staging cluster in us-central-1. "An uncommon interaction between internal infrastructure components" produced a node failure with no replacement until GCP recovered ~2 hours later. One cluster out of hundreds.
  • Customer stack context changes the urgency calculus. "For some of them, GCP's Pub/Sub served as the data source for their Redpanda BYOC clusters, so they needed to recover that first." Redpanda's position downstream of GCP-native sources meant upstream outages limited even a counterfactually-affected Redpanda cluster's customer urgency.

Profile-guided optimization (26.1, 2026-04-02)

Redpanda Streaming 26.1 enabled clang PGO for the broker binary, delivering ~10-15% overall efficiency improvement on small-batch CPU-intensive workloads — announced as a one-line feature in the 2026-03-31 26.1 launch post, then unpacked mechanism-by-mechanism in the 2026-04-02 engineering deep-dive (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization).

Measured wins on the canonical small-batch regression benchmark:

  • ~50% reduction in p50 latency.
  • Up to 47% reduction in p999 latency.
  • 15% reduction in CPU reactor utilization.

The amplification asymmetry (15% CPU → 47% p999) is the canonical batching-under-saturation shape — less CPU per request → shorter broker queue depth → disproportionately lower tail latency.

Diagnostic methodology: Redpanda used top-down microarchitecture analysis via Linux perf stat --topdown to identify the workload as 51% frontend-bound on baseline — "definitely on the higher end, even for database or distributed applications." PGO reduced frontend-bound to 37.9%, with 6 percentage points shifting to retiring (useful work) and 7 to backend-bound (revealed next bottleneck). Canonical example of TMA-guided optimisation target selection.

PGO vs BOLT evaluation: Redpanda evaluated both PGO and LLVM BOLT and chose PGO citing stability:

"PGO is a proven and widely deployed technology, so with this in mind and considering some outstanding BOLT bugs, we decided to stick with PGO."

BOLT performance was "similar to PGO. Most of the time, it came in just slightly behind." Redpanda hit LLVM bug llvm-project#169899 during their BOLT evaluation — the first wiki-canonical non-Meta BOLT brittleness datum (contrast with Meta's fleet-scale success via BOLT + Strobelight). Combining both gave "another small bump in performance"; the post preserves the option of "adding BOLT on top of PGO at some point."

Mechanisms applied: PGO enables hot-cold code splitting, basic-block reordering, and profile-driven inlining — all targeting instruction-cache locality. BOLT's heatmap visualisation tool confirmed the PGO-optimised binary packs hot functions tightly at the binary's start; the baseline distributed them across the binary.

See the canonical apply pattern at patterns/pgo-for-frontend-bound-application and the per-binary (non-fleet-scale) variant of patterns/feedback-directed-optimization-fleet-pipeline as the substrate framing.

Seen in

Last updated · 542 distilled / 1,571 read