Redpanda — Streaming IoT and event data into Snowflake and ClickHouse¶
Summary¶
Unsigned Redpanda Blog how-to / vendor-tutorial post (2025-12-09)
framing a reference pipeline for IoT + event data streaming:
Redpanda → Redpanda Connect → both Snowflake (short-term
real-time) and ClickHouse (long-term columnar archive)
simultaneously. Product-marketing voice — heavy on Redpanda
promotion and how-to config examples — but with a substantial
architectural core: the inverted storage-layer trade-off
(Snowflake for compliance-governed short-lived events,
ClickHouse for long-term columnar retention), the
time-partitioned MergeTree idiom for time-series data,
DETACH PARTITION archival as a cold-tier offload
mechanism, native TTL
policies for automatic deletion/movement of old rows,
LZ4-on-hot /
ZSTD-on-cold compression tiering via column-level CODEC(),
MergeTree
granule skipping via ORDER BY-key min-value storage per
granule,
Snowflake MATCH_RECOGNIZE pattern-based sessionization,
and the absence of a dedicated ClickHouse output connector in
Redpanda Connect (forcing the use of generic sql_raw /
sql_insert processors). Also discloses specific
Snowpipe Streaming
batching recommendations (500–1,000 records for low-latency
time series; 10,000+ for bulk; byte_size: 0 to disable
size-based batching; period 10–30s for real-time dashboards).
Key takeaways¶
-
Inverted storage-tier trade-off for compliance-sensitive IoT workloads: the default pairing (Snowflake = long-term archive, ClickHouse = real-time) "may be better off reversed" for compliance-sensitive workloads. Snowflake's "enterprise-level data governance makes it well-suited for streaming access logs and financial triggers, while ClickHouse's compression capabilities are great for long-term storage and analysis." The wiki canonicalises this as the pattern simultaneous-write to both, each for its trade-off. (Source: this ingest)
-
ClickHouse
PARTITION BY toYYYYMM(timestamp)is the canonical time-series idiom. Verbatim: "PARTITION BY toYYYYMM(timestamp)ortoDate(timestamp)will allow you to segment data monthly or daily". Enables fast partition pruning + whole-partition archival viaALTER TABLE telemetry_events DETACH PARTITION '202501'which moves partitions "physically detached from the active dataset and moved to external storage (like S3 or NAS)". (Source: this ingest) -
Native TTL policies automate deletion/movement — canonical statement:
ALTER TABLE telemetry_events MODIFY TTL timestamp + INTERVAL 12 MONTH DELETE;. TTL composes with partition-level archival — whole partitions can either be detached manually to cold storage or dropped automatically by TTL after a retention window. (Source: this ingest) -
Column-level
CODEC()enables compression tiering: hot data under a lightweight codec (LZ4) on local SSD; cold data under aggressive codec (ZSTD) on S3-backed storage. "Keep in mind that higher compression levels reduce storage footprint but increase data access latency. For this reason, they're best reserved for long-term archival tiers where data is infrequently queried." Canonicalised as concepts/hot-cold-tier-compression-codec-split. (Source: this ingest) -
MergeTree indexing is implicit min-value-per-granule skipping on
ORDER BYkey: "ClickHouse stores data in parts, each containing granules (blocks of rows). For each granule, it stores the minimum value of theORDERED BYkey. When you execute a query with a filter, ClickHouse uses the index to skip granules that don't match. This is called MergeTree indexing, and it's why ClickHouse can query across months or years of telemetry, logs, or CI/CD events in milliseconds." Controlled byindex_granularity— smaller granules = finer skipping but more index overhead. Canonicalised as concepts/clickhouse-index-granule-skipping. (Source: this ingest) -
Append-only storage model provides tamper resistance for audit logs: "Since ClickHouse doesn't support row-level updates or deletes (except via TTL or mutations), it's naturally resistant to silent data changes. For stronger guarantees, you can implement cryptographic hashing on inserted data, store those hashes in a separate immutable store, and monitor for divergence." Canonicalised as concepts/clickhouse-append-only-tamper-resistance. (Source: this ingest)
-
Snowpipe Streaming batching guidance (specific numbers): "Snowpipe streams new data in a set of batches (as opposed to individually) to optimize resource usage. Smaller batches (500 to 1,000 records) are ideal for low-latency streaming, while larger batches (over 10,000 records) are best suited for bulk processing. If you're using Snowflake for time series data, it's best to set this field to 1,000 (at most)." On
byte_size: "While Snowflake recommends 100–250 MB for data files, it's best to set this property to 0 to turn off size-based batching to simplify your configuration." Onperiod: "The period should range between ten to thirty seconds for real-time ingestion for dashboards and analytics. You can extend it to one to five minutes for less frequent updates." Extends canonicalisation on concepts/snowpipe-streaming-channel. (Source: this ingest) -
Snowflake
MATCH_RECOGNIZEis the canonical primitive for pattern-based sessionization: clicks with<= 10sinter-event gap from the same IP form a session. Verbatim worked example withPARTITION BY ip / ORDER BY click_timestamp / MEASURES MATCH_NUMBER() AS session_id, COUNT(*) AS no_of_events, FIRST(click_timestamp) AS start_time, LAST - FIRST AS duration / PATTERN (b s*) / DEFINE s AS (s.click_timestamp - PREV(s.click_timestamp)) <= INTERVAL '10' SECOND. Canonicalised as concepts/snowflake-match-recognize-sessionization. (Source: this ingest) -
Snowflake
schema_evolutionhas a performance trade-off — "if your data's structure is unlikely to change, turning this feature off may serve you better. It forces schema validation, helping you catch errors or inconsistencies in the incoming data before they are stored. Because continuous schema changes can add overhead during ingestion or query processing (especially with large-scale data pipelines), schema evolution may not be ideal in time-series contexts, where performance and retrieval speeds are critical." Contradicts the naive "always turn on schema evolution" framing. (Source: this ingest; see concepts/schema-evolution) -
No dedicated ClickHouse output connector in Redpanda Connect — use
sql_raw/sql_insert: "While there isn't a dedicated ClickHouse connector for Redpanda Connect yet, thesql_rawandsql_insertcomponents allow you to stream execute commands or stream data from Redpanda into ClickHouse. They're available as input, processor, and output types, so you've got flexibility in how you wire things up." Contrasts with the first-classsnowflake_streamingSnowflake connector. Means the ClickHouse leg of the pipeline gets the generic-SQL-sink hop cost rather than a protocol-native path. (Source: this ingest)
Systems / concepts / patterns¶
Systems¶
- Redpanda — streaming broker; IoT source aggregator.
- Redpanda Connect — pipeline
orchestrator;
kafka_franzinput +snowflake_streamingoutput +sql_raw/sql_insertfor ClickHouse. - Snowflake — one of two destination
storage tiers;
snowflake_streamingconnector for Snowpipe Streaming ingest;schema_evolutiontrade-off;MATCH_RECOGNIZEfor sessionization. - ClickHouse — second destination
storage tier; MergeTree time-partitioning; column-level
codecs;
DETACH PARTITIONarchival; native TTL. - AWS S3 — named as ClickHouse's cold-tier backing store for hybrid storage.
Concepts (new)¶
- concepts/clickhouse-mergetree-partition-by-time — time- partitioned MergeTree idiom for time-series ingest.
- concepts/clickhouse-ttl-policy — native TTL deletion / movement of old rows.
- concepts/clickhouse-detached-partition-archival — physically detach whole partitions to external cold storage.
- concepts/clickhouse-index-granule-skipping —
ORDER BY-key min-value storage per granule;index_granularitytunes skip granularity. - concepts/clickhouse-append-only-tamper-resistance — no-row-update model + external immutable hash ledger for audit-log tamper resistance.
- concepts/snowflake-match-recognize-sessionization —
pattern-matching session boundaries via
MATCH_RECOGNIZE. - concepts/hot-cold-tier-compression-codec-split —
per-column compression-codec choice by access pattern
(
LZ4hot,ZSTDcold).
Patterns (new)¶
- patterns/time-partitioned-mergetree-for-time-series —
ENGINE = MergeTree() PARTITION BY toYYYYMM(timestamp)as the canonical time-series schema shape in ClickHouse. - patterns/clickhouse-plus-snowflake-dual-storage-tier — simultaneous-write to both storage systems via Redpanda Connect output multiplexing, each covering its trade-off rather than chained ETL.
Operational numbers / disclosures¶
| Knob | Recommendation (verbatim) |
|---|---|
Snowpipe Streaming batch count (low-latency) |
500–1,000 records |
Snowpipe Streaming batch count (bulk) |
10,000+ records |
Snowpipe Streaming batch count (time-series) |
"1,000 (at most)" |
Snowpipe Streaming byte_size |
0 (disable — simpler config) |
| Snowflake documented file-size recommendation | 100–250 MB |
Snowpipe Streaming period (real-time) |
10–30 s |
Snowpipe Streaming period (less frequent) |
1–5 min |
| ClickHouse partition granularity (monthly) | toYYYYMM(timestamp) |
| ClickHouse partition granularity (daily) | toDate(timestamp) |
| Hot compression codec | LZ4 |
| Cold compression codec | ZSTD |
| Example session gap threshold | <= 10 seconds same-IP clicks |
| Example retention TTL | INTERVAL 12 MONTH DELETE |
Caveats¶
- Vendor-tutorial voice, not a production retrospective. No fleet numbers, no customer references, no measured latency/throughput data for the ClickHouse leg.
- ClickHouse leg is generic-SQL-sink (
sql_raw/sql_insert) — not an optimised columnar-native ingest path. The post notes "there isn't a dedicated ClickHouse connector for Redpanda Connect yet". No numbers on what this costs at scale. schema_evolutionoff-vs-on trade-off is named but not quantified — no numbers on how much ingest overhead schema evolution adds, or how often schema validation catches real errors.MATCH_RECOGNIZEis specifically a Snowflake feature at this syntax — the post uses it to illustrate sessionization in Snowflake, but does not discuss cost (Snowflake warehouse compute) or alternatives (ClickHouse has its owngroupArray+ window-function shape).index_granularitytuning guidance is qualitative only — "Smaller granules mean finer skipping but more overhead" without quantifying the overhead or naming the default.- Tamper-resistance framing is vendor-marketing-adjacent. The cryptographic-hash + external-immutable-store recommendation is correct but unelaborated; real implementations need Merkle trees or signed log receipts, which the post does not walk.
- IoT framing mixes use cases: healthcare wearables, manufacturing predictive maintenance, smart-city traffic, logistics cold-chain. Architecture discussion is generic across them; no domain-specific latency/freshness requirements broken out.
- No competitor engagement: Kafka + Debezium + Snowflake Kafka Connector + Materialize + StarRocks + Apache Druid all unaddressed.
- Unsigned — Redpanda blog default attribution (no byline); marketing-driven publication cadence implied.
Source¶
- Original: https://www.redpanda.com/blog/stream-iot-snowflake-clickhouse
- Raw markdown:
raw/redpanda/2025-12-09-streaming-iot-and-event-data-into-snowflake-and-clickhouse-eedf22f4.md