Skip to content

REDPANDA 2025-12-09

Read original ↗

Redpanda — Streaming IoT and event data into Snowflake and ClickHouse

Summary

Unsigned Redpanda Blog how-to / vendor-tutorial post (2025-12-09) framing a reference pipeline for IoT + event data streaming: Redpanda → Redpanda Connect → both Snowflake (short-term real-time) and ClickHouse (long-term columnar archive) simultaneously. Product-marketing voice — heavy on Redpanda promotion and how-to config examples — but with a substantial architectural core: the inverted storage-layer trade-off (Snowflake for compliance-governed short-lived events, ClickHouse for long-term columnar retention), the time-partitioned MergeTree idiom for time-series data, DETACH PARTITION archival as a cold-tier offload mechanism, native TTL policies for automatic deletion/movement of old rows, LZ4-on-hot / ZSTD-on-cold compression tiering via column-level CODEC(), MergeTree granule skipping via ORDER BY-key min-value storage per granule, Snowflake MATCH_RECOGNIZE pattern-based sessionization, and the absence of a dedicated ClickHouse output connector in Redpanda Connect (forcing the use of generic sql_raw / sql_insert processors). Also discloses specific Snowpipe Streaming batching recommendations (500–1,000 records for low-latency time series; 10,000+ for bulk; byte_size: 0 to disable size-based batching; period 10–30s for real-time dashboards).

Key takeaways

  1. Inverted storage-tier trade-off for compliance-sensitive IoT workloads: the default pairing (Snowflake = long-term archive, ClickHouse = real-time) "may be better off reversed" for compliance-sensitive workloads. Snowflake's "enterprise-level data governance makes it well-suited for streaming access logs and financial triggers, while ClickHouse's compression capabilities are great for long-term storage and analysis." The wiki canonicalises this as the pattern simultaneous-write to both, each for its trade-off. (Source: this ingest)

  2. ClickHouse PARTITION BY toYYYYMM(timestamp) is the canonical time-series idiom. Verbatim: "PARTITION BY toYYYYMM(timestamp) or toDate(timestamp) will allow you to segment data monthly or daily". Enables fast partition pruning + whole-partition archival via ALTER TABLE telemetry_events DETACH PARTITION '202501' which moves partitions "physically detached from the active dataset and moved to external storage (like S3 or NAS)". (Source: this ingest)

  3. Native TTL policies automate deletion/movement — canonical statement: ALTER TABLE telemetry_events MODIFY TTL timestamp + INTERVAL 12 MONTH DELETE;. TTL composes with partition-level archival — whole partitions can either be detached manually to cold storage or dropped automatically by TTL after a retention window. (Source: this ingest)

  4. Column-level CODEC() enables compression tiering: hot data under a lightweight codec (LZ4) on local SSD; cold data under aggressive codec (ZSTD) on S3-backed storage. "Keep in mind that higher compression levels reduce storage footprint but increase data access latency. For this reason, they're best reserved for long-term archival tiers where data is infrequently queried." Canonicalised as concepts/hot-cold-tier-compression-codec-split. (Source: this ingest)

  5. MergeTree indexing is implicit min-value-per-granule skipping on ORDER BY key: "ClickHouse stores data in parts, each containing granules (blocks of rows). For each granule, it stores the minimum value of the ORDERED BY key. When you execute a query with a filter, ClickHouse uses the index to skip granules that don't match. This is called MergeTree indexing, and it's why ClickHouse can query across months or years of telemetry, logs, or CI/CD events in milliseconds." Controlled by index_granularity — smaller granules = finer skipping but more index overhead. Canonicalised as concepts/clickhouse-index-granule-skipping. (Source: this ingest)

  6. Append-only storage model provides tamper resistance for audit logs: "Since ClickHouse doesn't support row-level updates or deletes (except via TTL or mutations), it's naturally resistant to silent data changes. For stronger guarantees, you can implement cryptographic hashing on inserted data, store those hashes in a separate immutable store, and monitor for divergence." Canonicalised as concepts/clickhouse-append-only-tamper-resistance. (Source: this ingest)

  7. Snowpipe Streaming batching guidance (specific numbers): "Snowpipe streams new data in a set of batches (as opposed to individually) to optimize resource usage. Smaller batches (500 to 1,000 records) are ideal for low-latency streaming, while larger batches (over 10,000 records) are best suited for bulk processing. If you're using Snowflake for time series data, it's best to set this field to 1,000 (at most)." On byte_size: "While Snowflake recommends 100–250 MB for data files, it's best to set this property to 0 to turn off size-based batching to simplify your configuration." On period: "The period should range between ten to thirty seconds for real-time ingestion for dashboards and analytics. You can extend it to one to five minutes for less frequent updates." Extends canonicalisation on concepts/snowpipe-streaming-channel. (Source: this ingest)

  8. Snowflake MATCH_RECOGNIZE is the canonical primitive for pattern-based sessionization: clicks with <= 10s inter-event gap from the same IP form a session. Verbatim worked example with PARTITION BY ip / ORDER BY click_timestamp / MEASURES MATCH_NUMBER() AS session_id, COUNT(*) AS no_of_events, FIRST(click_timestamp) AS start_time, LAST - FIRST AS duration / PATTERN (b s*) / DEFINE s AS (s.click_timestamp - PREV(s.click_timestamp)) <= INTERVAL '10' SECOND. Canonicalised as concepts/snowflake-match-recognize-sessionization. (Source: this ingest)

  9. Snowflake schema_evolution has a performance trade-off"if your data's structure is unlikely to change, turning this feature off may serve you better. It forces schema validation, helping you catch errors or inconsistencies in the incoming data before they are stored. Because continuous schema changes can add overhead during ingestion or query processing (especially with large-scale data pipelines), schema evolution may not be ideal in time-series contexts, where performance and retrieval speeds are critical." Contradicts the naive "always turn on schema evolution" framing. (Source: this ingest; see concepts/schema-evolution)

  10. No dedicated ClickHouse output connector in Redpanda Connect — use sql_raw / sql_insert: "While there isn't a dedicated ClickHouse connector for Redpanda Connect yet, the sql_raw and sql_insert components allow you to stream execute commands or stream data from Redpanda into ClickHouse. They're available as input, processor, and output types, so you've got flexibility in how you wire things up." Contrasts with the first-class snowflake_streaming Snowflake connector. Means the ClickHouse leg of the pipeline gets the generic-SQL-sink hop cost rather than a protocol-native path. (Source: this ingest)

Systems / concepts / patterns

Systems

  • Redpanda — streaming broker; IoT source aggregator.
  • Redpanda Connect — pipeline orchestrator; kafka_franz input + snowflake_streaming output + sql_raw/sql_insert for ClickHouse.
  • Snowflake — one of two destination storage tiers; snowflake_streaming connector for Snowpipe Streaming ingest; schema_evolution trade-off; MATCH_RECOGNIZE for sessionization.
  • ClickHouse — second destination storage tier; MergeTree time-partitioning; column-level codecs; DETACH PARTITION archival; native TTL.
  • AWS S3 — named as ClickHouse's cold-tier backing store for hybrid storage.

Concepts (new)

Patterns (new)

Operational numbers / disclosures

Knob Recommendation (verbatim)
Snowpipe Streaming batch count (low-latency) 500–1,000 records
Snowpipe Streaming batch count (bulk) 10,000+ records
Snowpipe Streaming batch count (time-series) "1,000 (at most)"
Snowpipe Streaming byte_size 0 (disable — simpler config)
Snowflake documented file-size recommendation 100–250 MB
Snowpipe Streaming period (real-time) 10–30 s
Snowpipe Streaming period (less frequent) 1–5 min
ClickHouse partition granularity (monthly) toYYYYMM(timestamp)
ClickHouse partition granularity (daily) toDate(timestamp)
Hot compression codec LZ4
Cold compression codec ZSTD
Example session gap threshold <= 10 seconds same-IP clicks
Example retention TTL INTERVAL 12 MONTH DELETE

Caveats

  • Vendor-tutorial voice, not a production retrospective. No fleet numbers, no customer references, no measured latency/throughput data for the ClickHouse leg.
  • ClickHouse leg is generic-SQL-sink (sql_raw / sql_insert) — not an optimised columnar-native ingest path. The post notes "there isn't a dedicated ClickHouse connector for Redpanda Connect yet". No numbers on what this costs at scale.
  • schema_evolution off-vs-on trade-off is named but not quantified — no numbers on how much ingest overhead schema evolution adds, or how often schema validation catches real errors.
  • MATCH_RECOGNIZE is specifically a Snowflake feature at this syntax — the post uses it to illustrate sessionization in Snowflake, but does not discuss cost (Snowflake warehouse compute) or alternatives (ClickHouse has its own groupArray + window-function shape).
  • index_granularity tuning guidance is qualitative only"Smaller granules mean finer skipping but more overhead" without quantifying the overhead or naming the default.
  • Tamper-resistance framing is vendor-marketing-adjacent. The cryptographic-hash + external-immutable-store recommendation is correct but unelaborated; real implementations need Merkle trees or signed log receipts, which the post does not walk.
  • IoT framing mixes use cases: healthcare wearables, manufacturing predictive maintenance, smart-city traffic, logistics cold-chain. Architecture discussion is generic across them; no domain-specific latency/freshness requirements broken out.
  • No competitor engagement: Kafka + Debezium + Snowflake Kafka Connector + Materialize + StarRocks + Apache Druid all unaddressed.
  • Unsigned — Redpanda blog default attribution (no byline); marketing-driven publication cadence implied.

Source

Last updated · 470 distilled / 1,213 read