CONCEPT Cited by 1 source

Grain misalignment¶

Grain misalignment is the data-engineering antipattern of running a single pipeline at the finest grain that any consumer needs, when different consumers actually have different natural grains. The cost penalty is structural: every run pays the finest-grain price even for the coarsest-grain consumer. Canonicalised on the wiki from the Octopus Energy MHHS rebuild — the first source to name the failure mode explicitly. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)

Definition¶

A pipeline exhibits grain misalignment when:

The pipeline's processing grain is set by the finest-grain consumer (e.g., half-hourly settlement requires HH data).
The pipeline is monolithic — all outputs are computed in one pass over a shared dataset.
Coarser-grain consumers (e.g., monthly billing) inherit the finest-grain processing cost on every run despite their own grain being unchanged.

The Octopus Energy diagnosis, verbatim:

"The legacy pipeline had been built around a single grain: monthly. Billing ran monthly. Settlement ran monthly. The entire pipeline was monolithic by design.

MHHS introduced a fundamental split. Industry cost data now arrives at half-hourly granularity — 48 data points per customer per day. Smart tariff customers with EVs and heat pumps need half-hourly revenue calculations. Standard tariff customers still settle monthly. Running all three through a single monolithic pipeline meant processing the entire dataset on every run, regardless of what had actually changed."

Why it happens¶

Grain misalignment usually emerges because the pipeline was correct at the time it was built. The Octopus legacy was designed when billing and settlement both ran monthly, so a single monthly grain was the natural grain — there was no misalignment to see. The misalignment appears when the business signal splits:

Trigger	Effect on grain
Regulatory change (e.g., MHHS forcing HH settlement)	New finest-grain consumer added; old pipeline grain inherited inappropriately
New product line (e.g., smart tariff with HH price signal)	Coarser-grain consumers still served, but now share infra with a finer-grain consumer
New analytical requirement (e.g., daily margin instead of monthly)	Same shape — finer-grain analytical pull on a coarser-grain pipeline

The general form: any time a system moves from monthly to daily, daily to real-time, or aggregate to transactional, the dynamics fire. The article generalises this verbatim.

Why it's the hidden cost driver¶

Three compounding effects:

Volume tax on every consumer. The pipeline processes the entire dataset on every run, regardless of which consumer triggered the run. The coarse-grain consumer pays for the fine-grain consumer's data even when nothing relevant changed for it.
Optimisation lever asymmetry. Optimisations that would help the finest-grain consumer (e.g., specialised compaction) may actively hurt the coarse-grain consumer's pattern, and vice versa. "What works as a Spark optimisation for Settlement is not necessarily right for NHH." A monolithic pipeline can't apply both.
Freshness handicap. The finest-grain consumer's freshness requirement — daily, hourly, real-time — pushes the entire pipeline's run cadence to that frequency, multiplying the cost for everyone.

The resolution: stream-per-grain split¶

The remedy is grain-aligned stream split — replace the monolithic pipeline with N streams, one per natural grain, each independently tunable, all built on a unified multi-grain source-of-truth that serves as the reconciliation bridge.

The Octopus rebuild split into three streams:

Settlement — half-hourly (industry settlement / cost allocation; "matches that grain exactly").
Half-Hourly — half-hourly (smart-tariff revenue; "the half-hourly price signal is the entire commercial proposition").
Monthly — monthly (standard-tariff revenue; "unchanged in grain but now reconcilable against the half-hourly data").

A "Job of Jobs" orchestration manages dependencies and parallel execution; each stream carries its own tuning profile.

Diagnostic questions¶

To detect grain misalignment in an existing pipeline:

What is the natural grain of each consumer? List them.
What is the pipeline's processing grain? Usually the finest-grain consumer's requirement.
For each consumer coarser than the pipeline's grain — would processing only the data that changed at that consumer's grain produce the same result? If yes, the consumer is paying the fine-grain tax unnecessarily.
Are different consumers' performance optimisations in conflict? If yes, the monolithic pipeline can't satisfy both.

If any of these surface a yes, the architecture has grain misalignment.

Generalised takeaway¶

The Octopus rebuild gives this concept its first canonicalised form on the wiki, with the disclosed financial signature: ~50× cost reduction per settlement date, ~$1M annualised cost avoidance, weekly → daily freshness, all from re-architecture rather than additional compute. The article's generalisation is explicit:

"Grain misalignment is the hidden cost driver. When a pipeline processes everything at the finest grain regardless of business need, you pay for it in compute, freshness, and maintenance complexity. Identify the natural grains in your data and align processing to them."

The takeaway pairs with concepts/remove-before-add-optimization as the two architectural principles the Octopus team named as transferable beyond UK energy: don't add compute to fix grain misalignment; restructure first.

Seen in¶

sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — first canonical disclosure. Octopus Energy's legacy monolithic monthly-grain pipeline running 25 B rows / run because MHHS forced half-hourly grain on every consumer, even monthly-billing consumers. Resolution: three-stream rebuild + Delta CDF + unified source of truth → 300 M rows / run, $0.48 / settlement date.

Concepts: concepts/data-pipeline-grain · concepts/remove-before-add-optimization
Patterns: patterns/grain-aligned-stream-split · patterns/cdf-incremental-replacing-full-rescan · patterns/job-of-jobs-orchestration
Systems: systems/octopus-margin-data-pipeline · systems/delta-lake · systems/apache-spark
Companies: companies/octopus-energy