Skip to content

PATTERN Cited by 1 source

ClickHouse plus Snowflake dual storage tier

Problem

An event / IoT pipeline needs both the strengths of Snowflake (enterprise governance, SQL richness, managed compute-storage separation, seamless integration with analytics / BI tools) and the strengths of ClickHouse (columnar compression, sub-second-latency queries on billions of rows, aggressive cold-tier compression, lower marginal cost on long-term retention).

The naive approach is ETL-chained — land in Snowflake first, then export to ClickHouse on a schedule. This adds latency, creates two schemas to keep in sync, and loses the flexibility to use each system for what it's good at.

Pattern

Write each event simultaneously to both Snowflake and ClickHouse through a single streaming broker + fan-out-aware connector layer, with each storage tier configured for its comparative advantage:

  • Redpanda (or any Kafka-API broker) is the durable log.
  • Redpanda Connect reads the topic once and writes to both sinks via parallel outputs — either broker (fan out to multiple outputs in parallel) or multiplexing (route messages per-record to a specific sink).
  • Snowflake leg uses the first-class snowflake_streaming output connector for low-latency row-level ingest.
  • ClickHouse leg uses generic sql_raw / sql_insert processors (no dedicated connector at the time of publication) targeting a time-partitioned MergeTree schema.

Verbatim framing from the 2025-12-09 Redpanda IoT pipeline post: "Using ClickHouse and Snowflake as two parts of your storage layer and simultaneously synchronizing data across both offers a more straightforward implementation and provides your pipeline with some modularity. Instead of using one to feed the other, ClickHouse can archive data on a local machine or persistent cloud storage (for experimentation, debugging, or troubleshooting). At the same time, Snowflake provides dashboards and real-time alerts." And: "You can easily set your pipeline to stream data into both Snowflake and ClickHouse simultaneously, without a broker. You just need to pair the sql_raw or sql_insert processor with the snowflake_streaming output connector." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)

Which tier gets what

The post also discusses the inverted assignment for compliance-sensitive workloads:

  • Default: Snowflake = long-term archive + BI; ClickHouse = real-time dashboard.
  • Inverted (compliance): Snowflake = streaming access logs + financial-trigger short-lived event storage (where governance matters); ClickHouse = long-term compressed retention (where column compression wins).

The point is that the two tiers cover orthogonal axes of the design space, and the pattern keeps both options open by writing to both.

Why dual-write rather than chain

Chained (A → B) Dual-write (both from broker)
Latency to second tier A-ingest + A-query + B-ingest B-ingest only
Schema drift surface A's schema shapes B's schema Independent schemas per tier
A-outage blast radius B stops getting data B unaffected
B-outage blast radius A unaffected A unaffected
Broker retention requirement Low (A is source) Higher (can replay to either sink)
Use case for each tier Forced by chain order Chosen per tier's strengths

Dual-write trades the broker's retention footprint for substantially more flexibility and fault isolation.

Composes with

Caveats

  • No exactly-once across both sinks unless both sinks provide it. Snowpipe Streaming has per-channel offset tokens. sql_insert against ClickHouse is at-least-once by default — duplicate handling needs idempotent inserts (e.g. ReplacingMergeTree, UUID primary keys).
  • Schema changes apply to both sinks — a column addition upstream must be handled independently in Snowflake (auto via schema_evolution: true) and ClickHouse (via ALTER TABLE).
  • Cost accounting is double — Snowflake warehouse compute + ClickHouse cluster compute + both storage volumes. Only makes sense when the tier-specific strength is actually used by downstream consumers.
  • ClickHouse connector is generic SQL (sql_raw / sql_insert), not a columnar-native ingest path. Native ClickHouse bulk-insert protocols (HTTP / gRPC with RowBinary framing) would outperform this, but are not wired up in Redpanda Connect as of the source date.
  • Multiplexing vs broker: if the same event should land in both tiers, use broker (fan-out). If different events should go to different tiers, use multiplexing. The choice depends on whether the two sinks store disjoint or identical event sets.

Seen in

Last updated · 470 distilled / 1,213 read