Skip to content

CONCEPT Cited by 1 source

Redshift Connector latency

Definition

Redshift Connector latency is the name for the multi-hour delay between "the data pipeline has produced output" and "that output is queryable in Redshift" when the data path runs via a streaming-to-Redshift data connector (e.g. Yelp's named Redshift Connector).

At Yelp, the specific number is ~10 hours for the Revenue Data Pipeline path (2025-05-27): "This introduced a latency of approximately 10 hours before the data was available in the data warehouse for verification."

Why it happens

Redshift connectors typically move data via:

  1. Pipeline writes output to a staging area.
  2. Connector polls the staging area on a batch schedule.
  3. Connector executes COPY into Redshift.
  4. Redshift vacuum / compaction / visibility steps.

Each step adds latency; the batch schedule + the connector's own queue + Redshift's indexing/compaction delay compound to multi-hour figures. For periodic jobs (e.g. daily data pipelines), this is acceptable for production reporting consumers — not acceptable for verification of the pipeline itself.

The impact on verification loops

If a data pipeline's primary consumers are Redshift queries (BI dashboards, monthly reports), the ~10-hour latency is invisible — reports are run on the N-1-day snapshot. But if the verification loop for pipeline correctness also runs on Redshift, then:

  • Bug discovery lags pipeline output by 10 hours.
  • Fix + rerun cycles are at-least-daily.
  • Same-day iteration on pipeline changes is infeasible.

This is the motivating constraint behind Yelp's staging pipeline, which routes around the connector by publishing to AWS Glue catalog tables on S3, immediately queryable via Redshift Spectrum.

The bypass pattern

The general shape is:

Pipeline output → Glue data catalog (metadata on S3 files)
                         │ immediate
                    Redshift Spectrum
                  (ad-hoc SQL over S3)

No connector, no COPY, no vacuum. Verification data lands immediately. Production data can still flow through the connector on its normal cadence.

This pattern works because verification queries don't need Redshift-resident data — they just need query-able data. Redshift Spectrum provides a query engine without the ingest- latency tax.

Generalisation

The pattern generalises beyond Redshift. Any batch-loading warehouse connector (Snowflake bulk ingest, BigQuery load jobs, etc.) has an analogous latency + bypass option:

  • Snowflake → Snowpipe streaming or direct external tables on S3 / GCS.
  • BigQuery → BigLake external tables on GCS.
  • ClickHouse → S3 table function for ad-hoc queries.

In each case, the discipline is: production uses the warehouse path; verification uses the data-lake-direct path. Two query paths, one data source of truth.

Caveats

  • Glue + Spectrum is not a drop-in Redshift replacement. Query performance is lower for complex joins; there's no materialised-view / indexed substrate. For verification SQL (typically aggregate + count queries over a day or month of data) it's fine; for production BI it often isn't.
  • Not every pipeline has this problem. If your Redshift connector is built for low-latency (e.g. Kinesis Data Firehose with small buffer intervals), the latency may be minutes, not hours. The Yelp case is a specific Data Connector with periodic batching.

Seen in

Last updated · 476 distilled / 1,218 read