CONCEPT

Unified change stream across shards¶

Definition¶

A unified change stream across shards is a single consumer-facing ordered stream of change events that papers over the fact that the source database is physically sharded across N MySQL (or Postgres) instances. The consumer sees a keyspace-level stream; the sharding layer fans out to every shard, interleaves the per-shard streams, attaches a single keyspace-wide progress token (VGTID in Vitess's case), and presents it as one API.

Canonical wiki instance: Vitess VStream at the VTGate level, which fans VStream RPCs across every shard in a keyspace and emits a single unified stream.

The consumer-side problem this solves¶

Without a unified stream, every CDC consumer of a sharded fleet has to:

Enumerate the shards (discover the topology).
Open a separate change-stream connection per shard.
Track a GTID position per shard independently.
Glue the per-shard event streams into a single logical order (or decide per-shard order is fine).
Handle shard-topology changes (shard split / merge / add / remove) by re-enumerating and re-establishing.

Engine-native CDC tools (like the Debezium MySQL connector run against a single member of a sharded Vitess cluster) can't do step 1 at all — they see only the one shard their binlog connection terminates on — and will silently miss rows from every other shard.

The sharding layer is the only place that has the full topology view, so the sharding layer is the right place to do the fan-out.

The sharding-layer-side contract¶

A unified change stream implementation at the sharding layer must:

Fan out to every shard's per-shard change stream in parallel.
Interleave events in a consumer-observable way. The ordering guarantee is typically "per-shard ordered + no cross-shard ordering" (Vitess: each ROW event carries its originating shard; order within a shard is transaction- order; no cross-shard ordering is claimed).
Checkpoint the full per-shard position set as one token so the consumer's reconnect contract stays simple (see concepts/vgtid).
Handle topology evolution — shard splits, merges, additions, removals — under the same token without requiring the consumer to re-enumerate. (Vitess: this is the payoff of passing the VGTID as a shard_gtids set rather than a single-shard position — new shards show up with absent GTIDs, old shards drain out when their GTIDs have been fully delivered.)
Expose the shard identity on each event so consumers that do care about per-shard order can process that way.

Why the structural posture matters¶

The 2024-07-29 post's lesson is explicit: "use a Vitess variant of the connector/driver rather than the MySQL one." The underlying point generalises — wherever a sharding layer sits between a consumer and an engine- native CDC stream, the CDC consumer must use the sharding-layer-native driver, not the engine-native one. Engine-native CDC drivers are by definition single-shard- blind.

Frames the wiki's understanding of how to expose change streams from sharded systems: the sharding layer owns the change stream. It cannot be delegated to individual shard endpoints without introducing silent data loss.

Seen in¶

— canonical wiki disclosure of the unified-change-stream shape via Vitess's VTGate VStream RPC. Matt Lord frames the VStream API as the Vitess-specific entrypoint that a CDC consumer must use instead of the engine-native MySQL binlog tooling: "This low-level VStream primitive … leverages this low-level component to stream data from the Shards within a Vitess Keyspace, providing a single unified change stream spanning the logical database which may consist of hundreds or even thousands of shards." The output walkthrough shows events from both shards (shard:"-80" + shard:"80-") interleaved into a single consumer stream with per-shard VGTID checkpoints.