Skip to content

SYSTEM Cited by 7 sources

Snowflake

Snowflake is a cloud OLAP data warehouse. Its defining property for system-design purposes is compute–storage separation: table data lives in the cloud object layer and query compute runs in ephemeral per-tenant clusters ("virtual warehouses") that can be sized and scaled independently. That separation is what lets workloads aggregate billions of rows of source data in minutes where an equivalent OLTP query plan would need hours.

Why it shows up in architectures

  • End-to-end recompute pipelines. With elastic compute, redoing the whole month's aggregation on each run is cheap enough to replace live-maintained incremental counters. Canva's Creators-payment pipeline uses Snowflake for exactly this: billions of usage events per month, aggregated in a few minutes per run. (Source: sources/2024-04-29-canva-scaling-to-count-billions; patterns/end-to-end-recompute)
  • ELT substrate. Snowflake is a common target for DBT model DAGs — transformations expressed as SQL, intermediate stages as views, the warehouse's compute doing the heavy lift. See systems/dbt, concepts/elt-vs-etl.
  • Primary data-warehousing tool for a company. Canva chose Snowflake for most regions because it was already their primary warehouse with "reliable infrastructure support" — architectural choice driven by existing data-platform operability.

Usage notes from sources

  • Aggregation runtime at Canva scale: billions of records in a few minutes, "several orders of magnitude faster" than the MySQL round-trip approach it replaced.
  • Used with an upstream replication pipeline that lands source data into Snowflake; the E + L are provided by Canva's data platform, Snowflake does the T.
  • Unload is a first-class piece of the story: results must be exported back into OLTP-friendly stores via S3 + SQS for serving; Snowflake is not a low-latency serving engine. See patterns/warehouse-unload-bridge.

Caveats

  • Not a serving tier. Query latency is seconds at best; bolt a warehouse-unload bridge in front of anything user-facing.
  • DBT codebase runs on its own release cadence as a separate deployable — schema evolution in upstream source tables couples to model releases. (See systems/dbt.)
  • Observability tooling for Snowflake + ELT is different from the services that feed / consume it; integration cost is real.

Snowpipe Streaming batching guidance (2025-12)

The 2025-12-09 Redpanda IoT-pipeline post discloses specific batching recommendations for the snowflake_streaming output connector that compose with Snowpipe Streaming channels:

Parameter Recommendation
count (low-latency) 500–1,000 records
count (bulk) 10,000+ records
count (time-series) "1,000 (at most)"
byte_size 0 (disable — simpler config)
Snowflake documented file-size target 100–250 MB
period (real-time dashboards) 10–30 s
period (less frequent) 1–5 min

"Smaller batches (500 to 1,000 records) are ideal for low- latency streaming, while larger batches (over 10,000 records) are best suited for bulk processing. ... byte_size allows you to define the batch size. While Snowflake recommends 100–250 MB for data files, it's best to set this property to 0 to turn off size-based batching to simplify your configuration." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)

Also notable: schema evolution has a performance cost, not just operational convenience. "If your data's structure is unlikely to change, turning this feature off may serve you better. ... schema evolution may not be ideal in time- series contexts, where performance and retrieval speeds are critical." Cross-ref: concepts/schema-evolution.

MATCH_RECOGNIZE for sessionization

Snowflake supports row-pattern matching via the ANSI SQL:2016 MATCH_RECOGNIZE clause — partitioning a sorted event stream into matching sub-sequences and emitting one row per match with aggregate measures. The canonical use case is sessionization (group click events into sessions by inter-event gap threshold). See concepts/snowflake-match-recognize-sessionization for the mechanism and canonical worked example.

Composes with Snowpipe Streaming: events land via channels, then sessionise in SQL without leaving Snowflake.

Seen in

  • sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhousecanonical wiki disclosure of Snowpipe Streaming batch tuning parameters and MATCH_RECOGNIZE sessionization. Redpanda IoT-pipeline tutorial (2025-12-09) names specific count / byte_size / period recommendations for the snowflake_streaming connector, frames the Snowflake leg of the dual-tier ClickHouse + Snowflake pattern, and discloses a MATCH_RECOGNIZE worked example for ≤ 10-second same-IP click sessionization. Also surfaces the schema_evolution performance trade-off ("turning this feature off may serve you better" for stable schemas).
  • sources/2025-10-02-redpanda-real-time-analytics-redpanda-snowflake-streamingSnowpipe Streaming as the low-latency row-level ingest surface benchmarked at 14.5 GB/s into a single table (3.8 billion 1 KB AVRO messages via Redpanda Connect's snowflake_streaming output connector). Exceeds Snowflake's documented 10 GB/s per-table ceiling by 45% via channel-count scaling (channel_prefix × max_in_flight, hard limit 10,000 channels per table). 86% of the 7.49 s P99 end-to-end latency lives in the Snowflake upload/register/commit steps — the analytical-sink commit path is the dominant latency contributor, not broker read or transport. Commit- path parallelism tuned via build_paralellism (set to 40 on 48-core nodes). Key operational finding: Snowpipe Streaming is a low-latency analytical-sink ingest primitive with per-channel exactly-once offset tokens and schema-evolution support; latency is seconds-level (P50 ~2 s, P99 ~7 s) rather than milliseconds, reaffirming the "not a serving tier" caveat.
  • sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-availableSnowflake Open Catalog (based on Apache Polaris) canonicalised as a named Iceberg REST catalog endpoint for streaming- broker-written Iceberg tables. Redpanda 25.1 Iceberg Topics GA release names Snowflake Open Catalog as one of three canonical catalog implementations (with Databricks Unity and AWS Glue) that brokers register tables against via OIDC+TLS sync. The framing positions Snowflake as not just a downstream analytics target but also a catalog in the federated-catalog sense — the same table identity is the control-plane anchor whether the reader is Snowflake, Databricks, ClickHouse, or any other Iceberg-compatible engine.
  • sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — Snowflake named as an Iceberg-compatible query engine reading Redpanda Iceberg topics as tables, and Snowflake Polaris named as a supported Iceberg REST catalog for topic-to-table registration. The pedagogy ingest that the 2025-04-07 GA post elaborates.
  • sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform — Instacart PIXEL uses Snowflake as the addressable-metadata layer for generated images: images land in S3; URLs are stored in Snowflake and retrievable by unique ID. Non-serving workload — Snowflake is used as an ML-platform metadata store, not an analytical query target.
  • sources/2024-04-29-canva-scaling-to-count-billions — Canva Creators-payment pipeline: Snowflake + DBT do dedup + aggregation in SQL, over DynamoDB-sourced raw events extracted to typed columns; outer-join overwrite enables end-to-end recompute.
  • sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws — Snowflake as the structured-data sink for an LLM-grounded document-intelligence pipeline. AArete's Doczy.ai writes structured contract-extraction output (from a Bedrock-LLM extraction stage upstream) into Snowflake at ~250 000 contracts/week peak; Snowflake powers the "intelligent dashboards with actionable insights and visualizations" and feeds downstream automation systems (claims-system reimbursement-term configuration, vendor-invoice verification). Sibling to Instacart PIXEL's addressable-metadata-layer face — both use Snowflake as a centralised structured repository for ML-pipeline output that downstream consumers query, rather than as the live analytical warehouse end-users hit directly.
Last updated · 542 distilled / 1,571 read