Skip to content

CONCEPT Cited by 1 source

Parallel snapshot (intra-table CDC)

Definition

Parallel snapshot is a CDC-connector optimisation that splits a single large table or collection into chunks and reads those chunks concurrently during the snapshot phase of a snapshot-plus-catch-up pipeline. Parallelism is intra-table (chunks of one table read in parallel), not inter-table (one-stream-per-table).

Canonical framing verbatim from the 2025-03-18 Redpanda post:

"Parallel snapshots allow multiple tables (PostgreSQL) or collections (MongoDB) to be read concurrently, radically reducing the time needed to complete the snapshot phase. Now here's where we're different: Redpanda's PostgreSQL and MongoDB CDC connectors can also parallelise reads for large tables. That means tables or collections with millions of records can be split into smaller chunks and read in parallel."

Why it matters

The snapshot phase of a CDC pipeline copies the source table's current state to the destination before transitioning to streaming the change log. For large tables (tens or hundreds of millions of rows, terabytes of data), this phase dominates end-to-end migration wall-clock. Serial copy of a single large table gates the full-pipeline latency on single-stream throughput.

Two parallelism strategies:

  • Inter-table parallelism. One stream per table or per destination shard. Scales with number of tables; does not help when one table dominates total volume. Debezium and most CDC stacks support this.
  • Intra-table parallelism. One stream per chunk within a single table. Scales with chunk count; accelerates the single-large-table case that inter-table parallelism can't help. The 2025-03 Redpanda post names this as the Redpanda Connect differentiator vs Debezium: "Debezium (Kafka Connect) does not do this today."

Mechanism questions left open

The post asserts the capability without disclosing the splitting mechanism. Open design questions the operator needs answered:

  • Chunk boundary selection. Primary-key ranges? Internal page / extent ranges? Sampled-value quantiles? Uniformly-spaced modular arithmetic? Each choice has different skew properties under non-uniform key distributions.
  • Consistency across parallel readers. Do all chunks share a single snapshot transaction (one consistent point in time), or does each chunk read in its own transaction (chunks come from different points in time, requiring reconciliation at the streaming-phase entry point)? Postgres supports pg_export_snapshot() + SET TRANSACTION SNAPSHOT for the former; MongoDB has no equivalent.
  • Boundary ordering. How is the transition from snapshot phase to streaming phase coordinated when chunks finish at different times? Does the connector start streaming before all chunks complete, or wait for the slowest chunk?
  • Failure and resumption. Per-chunk checkpointing? Restart any individual chunk, or restart the whole snapshot?
  • Hot-chunk handling. If one chunk covers a range with sustained writes during snapshot, does it fall behind and gate the whole phase?

The post doesn't engage with any of these; the claim is the capability exists, not the mechanism.

Competitive claim against Debezium

The 2025-03 post's load-bearing claim is that stock upstream Debezium does not ship intra-table parallel snapshot. Debezium 2.x supports blocking-snapshot signalling via the signal table and Debezium Server ships custom sink options, but parallel chunked reads of a single table during initial load is not a default-on configuration in the reference Debezium distribution at 2025-03.

(Users who need it in the Debezium ecosystem typically bolt on custom snapshot orchestration via snapshot-plus-catchup harness code.)

Scope

At 2025-03 the Redpanda-connector parallel-snapshot capability is present in the Postgres and MongoDB connectors only; MySQL and Spanner connectors don't advertise it. Enterprise-license gated.

Seen in

Last updated · 470 distilled / 1,213 read