PATTERN Cited by 1 source

Grain-aligned stream split¶

The grain-aligned stream split is the architectural pattern of replacing a monolithic finest-grain data pipeline with N independent streams, one per natural consumer-grain, sitting on a shared multi-grain source-of-truth layer. It is the structural remedy for grain misalignment — canonicalised on the wiki from the Octopus Energy MHHS rebuild. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)

Shape¶

                 Inputs (at the finest available grain)
                                 │
                                 ▼
            ┌───────────────────────────────────────┐
            │  Unified multi-grain source-of-truth  │
            │  (shared substrate, finest grain)     │
            │  reconciliation bridge across grains  │
            └────────────────┬──────────────────────┘
                             │
            ┌────────────────┼────────────────┐
            ▼                ▼                ▼
       ┌─────────┐    ┌──────────┐     ┌──────────┐
       │Stream A │    │Stream B  │     │Stream C  │
       │(grain α)│    │(grain β) │     │(grain γ) │
       │tuning   │    │tuning    │     │tuning    │
       │profile A│    │profile B │     │profile C │
       └─────────┘    └──────────┘     └──────────┘
            └───── Job of Jobs orchestration ─────┘
            (dependency mgmt + parallel execution)

When to apply¶

Apply when at least two are true:

Different consumers of the pipeline have different natural grains (e.g., regulatory settlement at HH vs billing at monthly).
The pipeline currently runs at the finest grain shared across all consumers, paying that volume tax for every consumer.
Per-stream optimisation profiles conflict — what helps one grain hurts another.
A regulatory or business event has multiplied data volume at a finer grain than the pipeline assumes (the Octopus MHHS trigger; generalises to "monthly→daily, daily→real-time, aggregate→transactional").

Forces¶

Force	What it favours	What it argues against
Cost	Stream split — coarse-grain consumers stop paying fine-grain volume tax	Monolithic — only one pipeline to operate
Tuning expressiveness	Stream split — each stream's tuning profile chosen independently	Monolithic — one set of choices applies to all
Operational complexity	Monolithic — one job, one schedule, one alert path	Stream split — N jobs, dependencies, orchestration
Reconciliation	Shared source-of-truth — coarse and fine outputs queryably consistent	Monolithic — same dataset, no reconciliation gap

The pattern resolves these forces by splitting at the processing layer while keeping the substrate shared. Reconciliation lives in the substrate.

Steps¶

Enumerate consumers and their natural grains. Don't assume the finest grain is the right one for everyone. (See concepts/data-pipeline-grain for the checklist.)
Choose the finest grain available across all inputs — that is the grain of the shared substrate.
Build the unified multi-grain source-of-truth layer. It reconciles between grains; coarse-grain outputs must be derivable from finer-grain ones.
Split processing into N streams, one per natural grain. Give each stream its own tuning profile.
Orchestrate with a "Job of Jobs" pattern. A higher-level scheduler manages dependencies and parallelism across streams; each stream's internal tuning is isolated.
Apply CDF-based incremental processing to the substrate — full overwrite on the source-of-truth layer would re-introduce the volume tax.
Audit existing optimisations before adding new ones (see concepts/remove-before-add-optimization). Measurement-driven removal is co-equal with addition.

The Octopus three-stream split (canonical instance)¶

Stream	Grain	Consumer
Settlement	Half-hourly	Industry settlement / cost allocation
Half-Hourly	Half-hourly	Smart-tariff customers (EVs, heat pumps, time-of-use)
Monthly	Monthly	Standard-tariff customers

The same grain (half-hourly) appears in two streams because the purpose differs — settlement and revenue have different downstream consumers, different optimisation requirements, and different reconciliation paths. Stream count is set by consumer-purpose × grain, not just grain.

Trade-offs¶

Operational complexity goes up. N streams, an orchestrator, and a shared substrate is a bigger surface than one monolithic job. The trade-off is justified when the cost or tuning asymmetry across grains is large enough — the Octopus rebuild produced a 2× legacy-comparison improvement and ~50× MHHS-projection improvement, which paid for the complexity many times over.
The substrate becomes a critical bottleneck. If the source-of-truth layer is wrong, every stream is wrong. The Octopus rebuild named the substrate "the site of the single highest-leverage optimisation" — both because it has the largest blast radius for performance and because it has the largest blast radius for correctness.
Reconciliation becomes a hard requirement, not a side effect. Monthly billing and HH settlement must agree at the customer level even though they're computed at different grains. The shared substrate is the bridge, but the architecture has to use it — cross-stream reconciliation queries become first-class outputs.
Per-stream tuning has to be measured. "What works as a Spark optimisation for Settlement is not necessarily right for NHH." Without per-stream measurement, the tuning isolation is theoretical.

patterns/cdf-incremental-replacing-full-rescan — applied to the shared substrate, this is what stops the substrate from re-introducing the monolithic volume tax. In the Octopus rebuild this single move dropped 25 B → 300 M rows / run.
patterns/job-of-jobs-orchestration — the orchestration primitive that lets each stream carry its own tuning while a higher-level scheduler enforces dependencies and parallelism.
patterns/broadcast-join-for-small-reference-tables — a Spark-tuning pattern applied within streams; the per-stream independence in this split is what lets one stream apply it while another doesn't.

Seen in¶

sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — canonical disclosure. Octopus Energy's MHHS-driven margin pipeline rebuild: three streams (Settlement / Half-Hourly / Monthly) on a shared multi-terabyte HH-grain source-of-truth layer, orchestrated by Job-of-Jobs. $0.48 / settlement date, ~$1M / yr cost avoidance, 3 months, team of three.

Patterns: patterns/cdf-incremental-replacing-full-rescan · patterns/job-of-jobs-orchestration · patterns/broadcast-join-for-small-reference-tables
Concepts: concepts/grain-misalignment · concepts/data-pipeline-grain · concepts/remove-before-add-optimization
Systems: systems/octopus-margin-data-pipeline
Companies: companies/octopus-energy · companies/databricks