PATTERN Cited by 1 source
Skip the intermediate representation¶
When a multi-stage pipeline stages its data through a format that is structurally lossy for the pipeline's end goal, and that lossy format is both the only thing downstream stages see and a commit point for upstream mistakes, the architectural response is to collapse the boundary: have the upstream stage produce the downstream stage's consumption shape directly, removing the intermediate from the pipeline.
Canonical instance: Google S2R¶
The wiki-level canonical instance is Google Research's 2025-10-07 Speech-to-Retrieval (S2R) post. The production-standard Cascade ASR architecture for voice search:
treats the text transcript as the intermediate representation. Google identifies this intermediate as the source of two structural problems:
- Information loss — prosody, emphasis, speaker acoustics, and homophone- disambiguating context live in the audio but not in the text.
- Error propagation — an ASR mistake commits because the retriever never sees the audio.
S2R's architectural move is to skip the transcript entirely: produce retrieval results directly from audio, without materialising a text string in between. The mechanism is not specified in the raw capture (likely an audio encoder into a shared embedding space with the document corpus, but not confirmed), but the structural move is the pattern this page names (Source: sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search).
When to apply¶
The pattern is a response to a specific diagnostic, not a default. Apply when all of:
- The intermediate representation is strictly less expressive than the upstream input for the end goal. Audio → text discards prosody; image → OCR text discards layout; raw log line → extracted metric discards context — all qualify. Plain-text HTTP body → parsed JSON for a JSON-consuming downstream does not (nothing useful is in the extra bytes).
- The downstream stage has no access to the original upstream input. If it did, it could cross-check and recover — the intermediate becomes a hint rather than a bottleneck.
- The intermediate is not a standalone product surface users require. If a human needs to read the transcript (live captioning, legal dictation), you can't simply remove it — you need a parallel path. The S2R framing is specifically for voice search where the transcript is an internal implementation detail.
- The quantitative upper-bound (via a groundtruth-upper- bound benchmark) says the ceiling of keeping the cascade is below the current quality target. If perfect-stage-A still doesn't meet the target, the structural argument for collapsing is strong.
How to apply¶
- Quantify the ceiling — run the groundtruth-upper-bound benchmark (replace the upstream stage with a human / oracle version) to measure how much of the gap between real-world quality and the ideal is attributable to upstream-stage mistakes (closable by better stage A) vs. to the lossy intermediate itself (only closable by skipping).
- Design an end-to-end stage B — re-architect the downstream stage to consume the raw upstream input directly. In S2R's case, this likely means an audio encoder producing retrieval embeddings in the same space as the document corpus.
- Train against the end-goal metric — previously stage A was trained to minimise transcription error (WER) and stage B independently to maximise retrieval (MRR). The end-to-end architecture can train the single combined model directly against the retrieval metric (or against MRR + auxiliary objectives).
- Validate against both the cascade and the cascade-groundtruth benchmark — the new architecture must beat Cascade ASR (real- world baseline), and can justify its architectural complexity by approaching or exceeding Cascade groundtruth (the structural ceiling of the cascade shape).
Variants and cousins¶
- Multimodal embedding retrieval — systems like CLIP skip the OCR-cascade by co-embedding image pixels and text into a shared space; direct image → retrieval at the embedding boundary. Same shape, different modalities.
- End-to-end ML for pipelines generally — the "just learn the whole thing end-to-end" impulse in modern ML is a generalisation of this pattern. Historically distinct stages (feature extraction, classification, post-processing) get collapsed into a single model trained against the final objective. Works when you can afford the data and when the intermediate was lossy; doesn't work when the intermediate was load-bearing for interpretability, debuggability, or product use.
- Speech-to-speech translation — the direct-audio analogue for machine translation (bypassing audio → transcript → translated transcript → audio), same structural argument.
Distinguishing from adjacent patterns¶
- patterns/cheap-approximator-with-expensive-fallback — different axis. That pattern keeps a multi-stage shape but adds a confidence-gated escape hatch between the fast/approximate and slow/exact paths. Skip-the-intermediate is about collapsing the stage boundary altogether, not about adding branching at it.
- patterns/draft-verify-inference / speculative decoding — operates at token granularity within a single LLM; skip-the- intermediate operates at pipeline-stage granularity across model boundaries. Different scales.
- patterns/teacher-student-model-compression — teacher and student are in a training relationship with a production handoff; the student runs standalone in production. Skip-the- intermediate is about removing a stage from the production path entirely.
Seen in¶
- sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search — canonical wiki instance; Google's Speech-to-Retrieval collapses the Cascade ASR → text retrieval pipeline for voice search by going directly from audio to retrieval results.