Skip to content

PATTERN Cited by 1 source

Grain-aligned stream split

The grain-aligned stream split is the architectural pattern of replacing a monolithic finest-grain data pipeline with N independent streams, one per natural consumer-grain, sitting on a shared multi-grain source-of-truth layer. It is the structural remedy for grain misalignment — canonicalised on the wiki from the Octopus Energy MHHS rebuild. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)

Shape

                 Inputs (at the finest available grain)
            ┌───────────────────────────────────────┐
            │  Unified multi-grain source-of-truth  │
            │  (shared substrate, finest grain)     │
            │  reconciliation bridge across grains  │
            └────────────────┬──────────────────────┘
            ┌────────────────┼────────────────┐
            ▼                ▼                ▼
       ┌─────────┐    ┌──────────┐     ┌──────────┐
       │Stream A │    │Stream B  │     │Stream C  │
       │(grain α)│    │(grain β) │     │(grain γ) │
       │tuning   │    │tuning    │     │tuning    │
       │profile A│    │profile B │     │profile C │
       └─────────┘    └──────────┘     └──────────┘
            └───── Job of Jobs orchestration ─────┘
            (dependency mgmt + parallel execution)

When to apply

Apply when at least two are true:

  • Different consumers of the pipeline have different natural grains (e.g., regulatory settlement at HH vs billing at monthly).
  • The pipeline currently runs at the finest grain shared across all consumers, paying that volume tax for every consumer.
  • Per-stream optimisation profiles conflict — what helps one grain hurts another.
  • A regulatory or business event has multiplied data volume at a finer grain than the pipeline assumes (the Octopus MHHS trigger; generalises to "monthly→daily, daily→real-time, aggregate→transactional").

Forces

Force What it favours What it argues against
Cost Stream split — coarse-grain consumers stop paying fine-grain volume tax Monolithic — only one pipeline to operate
Tuning expressiveness Stream split — each stream's tuning profile chosen independently Monolithic — one set of choices applies to all
Operational complexity Monolithic — one job, one schedule, one alert path Stream split — N jobs, dependencies, orchestration
Reconciliation Shared source-of-truth — coarse and fine outputs queryably consistent Monolithic — same dataset, no reconciliation gap

The pattern resolves these forces by splitting at the processing layer while keeping the substrate shared. Reconciliation lives in the substrate.

Steps

  1. Enumerate consumers and their natural grains. Don't assume the finest grain is the right one for everyone. (See concepts/data-pipeline-grain for the checklist.)
  2. Choose the finest grain available across all inputs — that is the grain of the shared substrate.
  3. Build the unified multi-grain source-of-truth layer. It reconciles between grains; coarse-grain outputs must be derivable from finer-grain ones.
  4. Split processing into N streams, one per natural grain. Give each stream its own tuning profile.
  5. Orchestrate with a "Job of Jobs" pattern. A higher-level scheduler manages dependencies and parallelism across streams; each stream's internal tuning is isolated.
  6. Apply CDF-based incremental processing to the substrate — full overwrite on the source-of-truth layer would re-introduce the volume tax.
  7. Audit existing optimisations before adding new ones (see concepts/remove-before-add-optimization). Measurement-driven removal is co-equal with addition.

The Octopus three-stream split (canonical instance)

Stream Grain Consumer
Settlement Half-hourly Industry settlement / cost allocation
Half-Hourly Half-hourly Smart-tariff customers (EVs, heat pumps, time-of-use)
Monthly Monthly Standard-tariff customers

The same grain (half-hourly) appears in two streams because the purpose differs — settlement and revenue have different downstream consumers, different optimisation requirements, and different reconciliation paths. Stream count is set by consumer-purpose × grain, not just grain.

Trade-offs

  • Operational complexity goes up. N streams, an orchestrator, and a shared substrate is a bigger surface than one monolithic job. The trade-off is justified when the cost or tuning asymmetry across grains is large enough — the Octopus rebuild produced a 2× legacy-comparison improvement and ~50× MHHS-projection improvement, which paid for the complexity many times over.
  • The substrate becomes a critical bottleneck. If the source-of-truth layer is wrong, every stream is wrong. The Octopus rebuild named the substrate "the site of the single highest-leverage optimisation" — both because it has the largest blast radius for performance and because it has the largest blast radius for correctness.
  • Reconciliation becomes a hard requirement, not a side effect. Monthly billing and HH settlement must agree at the customer level even though they're computed at different grains. The shared substrate is the bridge, but the architecture has to use it — cross-stream reconciliation queries become first-class outputs.
  • Per-stream tuning has to be measured. "What works as a Spark optimisation for Settlement is not necessarily right for NHH." Without per-stream measurement, the tuning isolation is theoretical.

Seen in

Last updated · 542 distilled / 1,571 read