PATTERN Cited by 1 source
ClickHouse plus Snowflake dual storage tier¶
Problem¶
An event / IoT pipeline needs both the strengths of Snowflake (enterprise governance, SQL richness, managed compute-storage separation, seamless integration with analytics / BI tools) and the strengths of ClickHouse (columnar compression, sub-second-latency queries on billions of rows, aggressive cold-tier compression, lower marginal cost on long-term retention).
The naive approach is ETL-chained — land in Snowflake first, then export to ClickHouse on a schedule. This adds latency, creates two schemas to keep in sync, and loses the flexibility to use each system for what it's good at.
Pattern¶
Write each event simultaneously to both Snowflake and ClickHouse through a single streaming broker + fan-out-aware connector layer, with each storage tier configured for its comparative advantage:
- Redpanda (or any Kafka-API broker) is the durable log.
- Redpanda Connect reads
the topic once and writes to both sinks via
parallel outputs — either
broker(fan out to multiple outputs in parallel) or multiplexing (route messages per-record to a specific sink). - Snowflake leg uses the first-class
snowflake_streamingoutput connector for low-latency row-level ingest. - ClickHouse leg uses generic
sql_raw/sql_insertprocessors (no dedicated connector at the time of publication) targeting a time-partitioned MergeTree schema.
Verbatim framing from the 2025-12-09 Redpanda IoT pipeline
post: "Using ClickHouse and Snowflake as two parts of your
storage layer and simultaneously synchronizing data across
both offers a more straightforward implementation and
provides your pipeline with some modularity. Instead of
using one to feed the other, ClickHouse can archive data on
a local machine or persistent cloud storage (for
experimentation, debugging, or troubleshooting). At the same
time, Snowflake provides dashboards and real-time alerts."
And: "You can easily set your pipeline to stream data into
both Snowflake and ClickHouse simultaneously, without a
broker. You just need to pair the sql_raw or sql_insert
processor with the snowflake_streaming output connector."
(Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)
Which tier gets what¶
The post also discusses the inverted assignment for compliance-sensitive workloads:
- Default: Snowflake = long-term archive + BI; ClickHouse = real-time dashboard.
- Inverted (compliance): Snowflake = streaming access logs + financial-trigger short-lived event storage (where governance matters); ClickHouse = long-term compressed retention (where column compression wins).
The point is that the two tiers cover orthogonal axes of the design space, and the pattern keeps both options open by writing to both.
Why dual-write rather than chain¶
| Chained (A → B) | Dual-write (both from broker) | |
|---|---|---|
| Latency to second tier | A-ingest + A-query + B-ingest | B-ingest only |
| Schema drift surface | A's schema shapes B's schema | Independent schemas per tier |
| A-outage blast radius | B stops getting data | B unaffected |
| B-outage blast radius | A unaffected | A unaffected |
| Broker retention requirement | Low (A is source) | Higher (can replay to either sink) |
| Use case for each tier | Forced by chain order | Chosen per tier's strengths |
Dual-write trades the broker's retention footprint for substantially more flexibility and fault isolation.
Composes with¶
- patterns/time-partitioned-mergetree-for-time-series — the ClickHouse leg's canonical schema shape.
- concepts/snowpipe-streaming-channel — the Snowflake
leg's ingest protocol.
channel_prefix+max_in_flighttune the per-table parallelism. - patterns/streaming-broker-as-lakehouse-bronze-sink — a sibling / generalisation; "broker as bronze → multiple downstream tiers" where each tier is itself a lakehouse zone. This dual-tier pattern is the two-tier materialisation of the same idea on a non-Iceberg substrate.
Caveats¶
- No exactly-once across both sinks unless both sinks
provide it. Snowpipe Streaming has per-channel offset
tokens.
sql_insertagainst ClickHouse is at-least-once by default — duplicate handling needs idempotent inserts (e.g.ReplacingMergeTree, UUID primary keys). - Schema changes apply to both sinks — a column
addition upstream must be handled independently in
Snowflake (auto via
schema_evolution: true) and ClickHouse (viaALTER TABLE). - Cost accounting is double — Snowflake warehouse compute + ClickHouse cluster compute + both storage volumes. Only makes sense when the tier-specific strength is actually used by downstream consumers.
- ClickHouse connector is generic SQL (
sql_raw/sql_insert), not a columnar-native ingest path. Native ClickHouse bulk-insert protocols (HTTP / gRPC with RowBinary framing) would outperform this, but are not wired up in Redpanda Connect as of the source date. - Multiplexing vs broker: if the same event should land
in both tiers, use
broker(fan-out). If different events should go to different tiers, use multiplexing. The choice depends on whether the two sinks store disjoint or identical event sets.
Seen in¶
- sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse
— canonical wiki introduction. Redpanda IoT-pipeline
tutorial post frames Snowflake + ClickHouse as two legs
of a simultaneous dual-write pipeline orchestrated by
Redpanda Connect
broker/ multiplexing output composition, with each tier chosen for its comparative-advantage property.