SYSTEM Cited by 7 sources
Snowflake¶
Snowflake is a cloud OLAP data warehouse. Its defining property for system-design purposes is compute–storage separation: table data lives in the cloud object layer and query compute runs in ephemeral per-tenant clusters ("virtual warehouses") that can be sized and scaled independently. That separation is what lets workloads aggregate billions of rows of source data in minutes where an equivalent OLTP query plan would need hours.
Why it shows up in architectures¶
- End-to-end recompute pipelines. With elastic compute, redoing the whole month's aggregation on each run is cheap enough to replace live-maintained incremental counters. Canva's Creators-payment pipeline uses Snowflake for exactly this: billions of usage events per month, aggregated in a few minutes per run. (Source: sources/2024-04-29-canva-scaling-to-count-billions; patterns/end-to-end-recompute)
- ELT substrate. Snowflake is a common target for DBT model DAGs — transformations expressed as SQL, intermediate stages as views, the warehouse's compute doing the heavy lift. See systems/dbt, concepts/elt-vs-etl.
- Primary data-warehousing tool for a company. Canva chose Snowflake for most regions because it was already their primary warehouse with "reliable infrastructure support" — architectural choice driven by existing data-platform operability.
Usage notes from sources¶
- Aggregation runtime at Canva scale: billions of records in a few minutes, "several orders of magnitude faster" than the MySQL round-trip approach it replaced.
- Used with an upstream replication pipeline that lands source data into Snowflake; the E + L are provided by Canva's data platform, Snowflake does the T.
- Unload is a first-class piece of the story: results must be exported back into OLTP-friendly stores via S3 + SQS for serving; Snowflake is not a low-latency serving engine. See patterns/warehouse-unload-bridge.
Caveats¶
- Not a serving tier. Query latency is seconds at best; bolt a warehouse-unload bridge in front of anything user-facing.
- DBT codebase runs on its own release cadence as a separate deployable — schema evolution in upstream source tables couples to model releases. (See systems/dbt.)
- Observability tooling for Snowflake + ELT is different from the services that feed / consume it; integration cost is real.
Snowpipe Streaming batching guidance (2025-12)¶
The 2025-12-09 Redpanda IoT-pipeline post discloses specific
batching recommendations for the snowflake_streaming output
connector that compose with
Snowpipe Streaming
channels:
| Parameter | Recommendation |
|---|---|
count (low-latency) |
500–1,000 records |
count (bulk) |
10,000+ records |
count (time-series) |
"1,000 (at most)" |
byte_size |
0 (disable — simpler config) |
| Snowflake documented file-size target | 100–250 MB |
period (real-time dashboards) |
10–30 s |
period (less frequent) |
1–5 min |
"Smaller batches (500 to 1,000 records) are ideal for low-
latency streaming, while larger batches (over 10,000 records)
are best suited for bulk processing. ... byte_size allows
you to define the batch size. While Snowflake recommends
100–250 MB for data files, it's best to set this property to
0 to turn off size-based batching to simplify your
configuration."
(Source:
sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)
Also notable: schema evolution has a performance cost, not just operational convenience. "If your data's structure is unlikely to change, turning this feature off may serve you better. ... schema evolution may not be ideal in time- series contexts, where performance and retrieval speeds are critical." Cross-ref: concepts/schema-evolution.
MATCH_RECOGNIZE for sessionization¶
Snowflake supports row-pattern matching via the ANSI
SQL:2016 MATCH_RECOGNIZE clause — partitioning a sorted
event stream into matching sub-sequences and emitting one
row per match with aggregate measures. The canonical use
case is sessionization (group click events into sessions
by inter-event gap threshold). See
concepts/snowflake-match-recognize-sessionization for
the mechanism and canonical worked example.
Composes with Snowpipe Streaming: events land via channels, then sessionise in SQL without leaving Snowflake.
Seen in¶
- sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse
— canonical wiki disclosure of Snowpipe Streaming batch
tuning parameters and
MATCH_RECOGNIZEsessionization. Redpanda IoT-pipeline tutorial (2025-12-09) names specificcount/byte_size/periodrecommendations for thesnowflake_streamingconnector, frames the Snowflake leg of the dual-tier ClickHouse + Snowflake pattern, and discloses aMATCH_RECOGNIZEworked example for ≤ 10-second same-IP click sessionization. Also surfaces theschema_evolutionperformance trade-off ("turning this feature off may serve you better" for stable schemas). - sources/2025-10-02-redpanda-real-time-analytics-redpanda-snowflake-streaming
— Snowpipe Streaming as the low-latency row-level ingest
surface benchmarked at 14.5 GB/s into a single table
(3.8 billion 1 KB AVRO messages via Redpanda Connect's
snowflake_streamingoutput connector). Exceeds Snowflake's documented 10 GB/s per-table ceiling by 45% via channel-count scaling (channel_prefix×max_in_flight, hard limit 10,000 channels per table). 86% of the 7.49 s P99 end-to-end latency lives in the Snowflake upload/register/commit steps — the analytical-sink commit path is the dominant latency contributor, not broker read or transport. Commit- path parallelism tuned viabuild_paralellism(set to 40 on 48-core nodes). Key operational finding: Snowpipe Streaming is a low-latency analytical-sink ingest primitive with per-channel exactly-once offset tokens and schema-evolution support; latency is seconds-level (P50 ~2 s, P99 ~7 s) rather than milliseconds, reaffirming the "not a serving tier" caveat. - sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available — Snowflake Open Catalog (based on Apache Polaris) canonicalised as a named Iceberg REST catalog endpoint for streaming- broker-written Iceberg tables. Redpanda 25.1 Iceberg Topics GA release names Snowflake Open Catalog as one of three canonical catalog implementations (with Databricks Unity and AWS Glue) that brokers register tables against via OIDC+TLS sync. The framing positions Snowflake as not just a downstream analytics target but also a catalog in the federated-catalog sense — the same table identity is the control-plane anchor whether the reader is Snowflake, Databricks, ClickHouse, or any other Iceberg-compatible engine.
- sources/2025-01-21-redpanda-implementing-the-medallion-architecture-with-redpanda — Snowflake named as an Iceberg-compatible query engine reading Redpanda Iceberg topics as tables, and Snowflake Polaris named as a supported Iceberg REST catalog for topic-to-table registration. The pedagogy ingest that the 2025-04-07 GA post elaborates.
- sources/2025-07-17-instacart-introducing-pixel-instacarts-unified-image-generation-platform — Instacart PIXEL uses Snowflake as the addressable-metadata layer for generated images: images land in S3; URLs are stored in Snowflake and retrievable by unique ID. Non-serving workload — Snowflake is used as an ML-platform metadata store, not an analytical query target.
- sources/2024-04-29-canva-scaling-to-count-billions — Canva Creators-payment pipeline: Snowflake + DBT do dedup + aggregation in SQL, over DynamoDB-sourced raw events extracted to typed columns; outer-join overwrite enables end-to-end recompute.
- sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws — Snowflake as the structured-data sink for an LLM-grounded document-intelligence pipeline. AArete's Doczy.ai writes structured contract-extraction output (from a Bedrock-LLM extraction stage upstream) into Snowflake at ~250 000 contracts/week peak; Snowflake powers the "intelligent dashboards with actionable insights and visualizations" and feeds downstream automation systems (claims-system reimbursement-term configuration, vendor-invoice verification). Sibling to Instacart PIXEL's addressable-metadata-layer face — both use Snowflake as a centralised structured repository for ML-pipeline output that downstream consumers query, rather than as the live analytical warehouse end-users hit directly.
Related¶
- systems/dbt, systems/aws-s3
- systems/clickhouse — sibling storage tier in the dual-tier streaming pattern.
- concepts/oltp-vs-olap, concepts/elt-vs-etl, concepts/compute-storage-separation
- concepts/snowpipe-streaming-channel — low-latency row-level ingest surface.
- concepts/schema-evolution — performance trade-off named explicitly on time-series workloads.
- concepts/snowflake-match-recognize-sessionization — SQL primitive for session-boundary detection.
- patterns/end-to-end-recompute, patterns/warehouse-unload-bridge
- patterns/clickhouse-plus-snowflake-dual-storage-tier — the dual-tier pattern.
- patterns/managed-ai-document-intelligence-pipeline-on-aws — Snowflake as the analytical sink in this pipeline shape.
- systems/doczy-ai