PlanetScale — Building data pipelines with Vitess¶
Summary¶
Matt Lord (Vitess core maintainer, PlanetScale) canonicalises
VStream — the low-level gRPC change
stream primitive in every VTGate that
exposes a unified Vitess keyspace's committed
changes as a single ordered stream regardless of how many shards
back the keyspace. VStream is the substrate under Vitess's own
internal VReplication workflows
(MoveTables, Reshard, Materialize) and the load-bearing
public API under every third-party CDC driver for Vitess:
Debezium's Vitess connector,
the Airbyte Vitess source,
the Fivetran Vitess source,
and PlanetScale's own Connect
feature. Canonical wiki framing: Vitess's sharding layer exposes
a single-keyspace change stream to downstream ETL / data-
warehouse / data-lake consumers, papering over the
shard-count-may-be-hundreds-or-thousands reality beneath.
Key takeaways¶
-
OLTP vs OLAP framing load-bearing. Matt opens with the standard taxonomy: Vitess + MySQL are tuned for OLTP — direct-user interaction, fast single-row response times, critical business records (orders, user profiles) — but "are not optimized for OLAP workloads and other use cases and needs that you will encounter as your product, company, and data needs grow." Canonical wiki statement of the workload-archetype split as the motivation for CDC pipelines: keep the OLTP system the source of truth, use CDC to maintain in-sync copies in systems tuned for analytics / reporting / integration. (Source: the post body.)
-
VStream is the Vitess CDC primitive. "Vitess has a number of primitives or building blocks that make it easy to build your data pipelines. These are features of VReplication, a powerful system that allows for various types of data replication and transformation. For CDC and similar use cases, VReplication provides the VStream API in VTGates (Vitess Gateways) that allows you to stream changes from a Vitess cluster in real-time." VStream is the low-level gRPC surface (Vitess source links: queryservice.proto L103-L113 for the internal tablet RPC, vtgateservice.proto L55-L56 for the public VTGate RPC) that VReplication itself uses for internal tablet-to-tablet data motion; the VTGate VStream RPC fans out across all shards in a keyspace to produce a single unified change stream spanning hundreds or thousands of shards. (Source: the post body.)
-
VGTID is the unified change-stream position. The VStream output stream shows explicit VGTID messages whose payload is a set of per-shard GTID positions — one entry per shard in the keyspace. Worked example from the post's output:
The VGTID is the single durable restart-checkpoint for a VStream consumer; present it back to VStream on reconnect and the stream resumes from exactly-where-it-left-off across all shards. Canonical new wiki concept. (Source: the post body output logs.)
-
Copy phase then replication phase, per-shard in parallel. The VStream output follows the familiar VReplication shape at a keyspace-wide altitude: each shard first snapshots the current state of the tracked tables (rows arrive as
FIELDtype-schema events +ROWrow-data events +COPY_COMPLETEDper-shard sentinels), then transitions into continuous binlog-replication per shard. The output interleaves events across shards (shard:"-80"rows andshard:"80-"rows appear together), but each row carries its originating shard and each VGTID advances only that shard's GTID entry — so consumers can either process per- shard or treat the keyspace as a single logical stream. Canonical wiki application of patterns/snapshot-plus-catchup-replication at the public CDC API altitude, distinct from the internal VReplication altitude. (Source: the post body output logs.) -
Debezium + Airbyte + Fivetran compose on the VStream API. "This low-level VStream primitive is then used by popular CDC tools like Debezium to capture changes in Vitess and propagate them to other systems. PlanetScale also uses the VStream API to build the Connect feature, using additional open source drivers for popular CDC/ETL services such as Airbyte (source) and Fivetran (source)." Four distinct consumer ecosystems ride on one Vitess-vendor API — canonical wiki instance of the new patterns/cdc-driver-ecosystem pattern: publish one canonical change-stream API and let each ETL-service ecosystem write its own driver rather than fork-per-vendor. (Source: the post body.)
-
Use the Vitess-specific driver, not the MySQL one. "This also demonstrates the general rule that in setting these kinds of systems up you would use a Vitess variant of the connector/driver rather than the MySQL one — with things otherwise being the same." Structural point: binlog-per- MySQL-instance tools (Debezium MySQL connector) see a per-shard view of a sharded Vitess cluster and can't reconstruct the keyspace-level stream; the Vitess connector talks to VTGate's VStream and gets the unified stream by construction. Canonical wiki framing of the sharding-layer's opacity to engine-native CDC tools. (Source: the post body.)
-
Default target examples named: the post links a Debezium walkthrough that streams Postgres-to-Postgres, and suggests an analogous Vitess → AWS RedShift pipeline (RedShift being based on PostgreSQL). Canonical wiki datum of the two-DB-engines + Vitess-connector
- per-target-DBMS-connector pattern shape for moving OLTP data into an analytical warehouse. (Source: the post body.)
Architectural framings¶
VStream: per-shard tablet RPC + keyspace-wide VTGate RPC¶
The two VStream RPCs are a tidy layering:
- Internal (tablet-to-tablet) VStream — used by
VReplication's own workflows (
MoveTables,Reshard) to move rows between two specific tablets; per-source-shard basis; the substrate under every snapshot + catch- up replication internal to Vitess. - Public (VTGate) VStream — fans out the tablet RPCs across every shard of the target keyspace + interleaves the results
- attaches VGTID checkpoints. This is the API every external CDC driver calls.
Same underlying primitive, two consumer altitudes — which is why VStream doubles as both the internal data-motion substrate and the external CDC API without API duplication.
VGTID as the consumer's single progress token¶
Before VStream, a CDC consumer of a sharded MySQL fleet had to track a GTID position per shard and glue events into a reasonable order. VGTID moves that responsibility server-side: VTGate emits one VGTID message periodically whose payload is the full per-shard GTID map; consumers persist that one token and resubmit it on reconnect. Canonical wiki extension of concepts/gtid-position from the per-server altitude to the keyspace altitude.
Driver ecosystem as the payoff of a stable API¶
By publishing the VStream API as the one Vitess-side CDC entrypoint, PlanetScale + the Vitess community unlocked three distinct driver-authoring efforts (Debezium, Airbyte, Fivetran) without having to build every downstream connector themselves. Canonical new patterns/cdc-driver-ecosystem pattern — single-vendor-API + multi-ecosystem-driver — applied wherever an infrastructure vendor wants broad ETL integration without owning every target-system connector.
Caveats¶
- Short post with output logs, not architecture retrospective. Dense on the "here's how to use it" axis + output-log walkthrough; light on per-primitive performance disclosure (no events/sec, no per-shard backpressure behaviour under lag, no VGTID-checkpoint commit cadence, no consumer-failover semantics).
- OLTP/OLAP framing is elementary. The post opens with explanatory links to Confluent's CDC explainer, IBM's data- pipeline explainer, etc — it's written for a Postgres / MySQL DBA persona who hasn't composed a CDC pipeline before, not the Vitess-internals-expert reader.
- No VGTID-format specification beyond the worked example.
The VGTID payload shape is clear from the output logs
(
shard_gtidsarray) but no formal protobuf schema, cardinality bounds, serialisation cost, or size-scaling- with-shard-count analysis. - Connector matrix not described. Debezium + Airbyte + Fivetran are named with links, but differences between them (row-event format, schema-change handling, DDL propagation, filter semantics, throughput posture) not compared.
- No failure-mode analysis. What happens when a shard falls behind the binlog retention horizon mid-stream? When VTGate fails over? When a Reshard runs concurrently with the VStream? All out of scope here (some implicit in the prior 2026-02-16 zero-downtime-migrations post via the underlying VReplication substrate).
- Republished 2024 post. Post bylines 2024-07-29 in the body but was re-fetched 2026-04-21 alongside the broader PlanetScale re-publication wave; architectural content is stable relative to its original publication.
- Vendor-adjacent framing: PlanetScale builds Connect on top of the VStream API; the narrative positions VStream as the capability that enables the PlanetScale Connect feature, though the API itself is upstream open-source Vitess.
Source¶
- Original: https://planetscale.com/blog/building-data-pipelines-with-vitess
- Raw markdown:
raw/planetscale/2026-04-21-building-data-pipelines-with-vitess-7376eda9.md
Related¶
- systems/vitess
- systems/vitess-vstream
- systems/vitess-vreplication
- systems/vitess-movetables
- systems/planetscale
- systems/planetscale-connect
- systems/debezium
- systems/mysql
- concepts/change-data-capture
- concepts/vgtid
- concepts/gtid-position
- concepts/binlog-replication
- concepts/oltp-vs-olap
- concepts/unified-change-stream-across-shards
- patterns/snapshot-plus-catchup-replication
- patterns/cdc-driver-ecosystem
- companies/planetscale