Skip to content

CONCEPT Cited by 1 source

Snowflake MATCH_RECOGNIZE sessionization

Definition

MATCH_RECOGNIZE is Snowflake's SQL clause for row-pattern matching over an ordered event stream. Given a partitioning key (e.g. ip), an ordering key (e.g. click_timestamp), and a regex-like pattern over row categories, Snowflake identifies matching row sequences and emits one row per match with aggregate measures.

The canonical use case is sessionization — grouping consecutive events into sessions based on an inter-event gap threshold (e.g. "clicks are the same session if the gap is ≤ 10 seconds and they come from the same IP").

Verbatim example from the 2025-12-09 Redpanda IoT pipeline post:

SELECT
    ip,
    session_id,
    start_time,
    no_of_events,
    duration
FROM clicks
MATCH_RECOGNIZE (
    PARTITION BY ip
    ORDER BY click_timestamp
    MEASURES
        MATCH_NUMBER() AS session_id,
        COUNT(*) AS no_of_events,
        FIRST(click_timestamp) AS start_time,
        LAST(click_timestamp) - FIRST(click_timestamp) AS duration
    ONE ROW PER MATCH
    PATTERN (b s*)
    DEFINE
        s AS (s.click_timestamp - PREV(s.click_timestamp)) <= INTERVAL '10' SECOND
);

Framing: "you're writing an implementation to track clicks on a website. Each click triggers a Redpanda data stream or message exchange that's then sent and stored in a Snowflake table. Your goal is to track how stimulating the website is, so you use sessionization to group consecutive clicks. In this scenario, clicks are considered a part of the same session if the time difference between them is less than or equal to ten seconds and they're performed from the same IP address." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)

Anatomy

  • PARTITION BY — scope the pattern match within each partition key. In the example, sessions are per-IP.
  • ORDER BY — the ordering over which the pattern is evaluated. Usually a timestamp.
  • PATTERN — a regex-like expression over symbolic row categories defined in DEFINE. b s* = "one boundary row, then zero-or-more same-session rows".
  • DEFINE — defines the categories. s = "this row is within 10 seconds of the previous row". b is implicit (any row) when not defined.
  • MEASURES — what to emit per match. MATCH_NUMBER() gives a monotonic session id; aggregates over the matched rows give session stats.
  • ONE ROW PER MATCH — one output row per session, not one per input event. Alternative: ALL ROWS PER MATCH.

Why it matters for streaming analytics

  • Sessionization is a common analytics primitive and historically required either stream processors (Flink, Kafka Streams) or application-level logic over window functions.
  • MATCH_RECOGNIZE keeps it in SQL — the same engine that stores the event data also sessionizes it. No separate stream-processing layer needed if Snowflake is already the destination.
  • Composes with Snowpipe Streaming — events land in Snowflake via Snowpipe Streaming channels and are session-analyzed in SQL without leaving Snowflake.

Caveats

  • Snowflake-specific syntax. MATCH_RECOGNIZE is in ANSI SQL:2016, but adoption is uneven. Oracle supports it; PostgreSQL doesn't; ClickHouse has its own groupArray + window-function approach for similar problems.
  • Cost accounting on Snowflake warehouse compute. MATCH_RECOGNIZE queries run on a virtual warehouse and bill accordingly; they're not "free" the way streaming frameworks' sessionization might be.
  • Real-time gap. Snowflake is an OLAP warehouse — sessionization via MATCH_RECOGNIZE is seconds-to-minutes latency via Snowpipe Streaming + query, not millisecond real-time. Use a stream processor if you need sub-second session closure.
  • Pattern complexity limits. Simple gap-based sessionization is easy; multi-state business-logic sessions (e.g. "cart → checkout → payment within 5 minutes") push the pattern-DEFINE complexity fast.

Seen in

Last updated · 470 distilled / 1,213 read