CONCEPT Cited by 1 source
Data pipeline grain¶
Data pipeline grain is the time (or other dimensional) resolution at which a pipeline processes its inputs and emits its outputs. Every consumer of the pipeline has a natural grain — the resolution at which decisions are made or obligations are enforced — and the architectural question is whether one shared pipeline grain serves every consumer or whether the pipeline must split into multiple grain-specialised streams. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)
What "grain" means here¶
Grain is a resolution / cardinality property of pipeline inputs and outputs:
- Input grain — how frequently or finely data points arrive (e.g., 2 meter reads / customer / month → 48 reads / customer / day).
- Output grain — the resolution at which the pipeline emits results to a consumer (e.g., a monthly bill row vs a half-hourly settlement row).
- Processing grain — the resolution the pipeline operates at internally. In a monolithic pipeline this is set by the finest-grain consumer; in a grain-aligned split it varies per stream.
Grain isn't always temporal — for hierarchical entities it can be "per-customer" vs "per-meter" vs "per-circuit" — but the canonical case in the Octopus rebuild is time-grain: half-hourly vs monthly.
The Octopus three grains (canonical instance)¶
The Octopus margin pipeline canonicalised three distinct natural grains on the wiki:
| Stream | Grain | Why this is the natural grain |
|---|---|---|
| Settlement | Half-hourly | "Industry charges at 48 data points per day; this stream matches that grain exactly." |
| Half-Hourly | Half-hourly | "Smart tariff customers: EV drivers, heat pump users, and time-of-use products where the half-hourly price signal is the entire commercial proposition." |
| Monthly | Monthly | "Standard tariff customers, unchanged in grain but now reconcilable against the half-hourly data." |
Two streams share a grain (HH) but not a purpose; the third is at a fundamentally coarser grain (monthly). All three converge on a unified multi-grain source-of-truth layer that holds inputs at the finest available grain and serves them to each stream at the stream's chosen processing grain.
Why the choice of grain is load-bearing¶
Two named consequences in the Octopus source:
- Cost is set by grain. A pipeline running at half-hourly grain processes ~48× the data points per customer per day vs a monthly pipeline. If the grain is finer than the consumer needs, you pay the volume tax for nothing. (See concepts/grain-misalignment for the failure mode.)
- Tuning is set by grain. "Each stream is independently tunable — what works as a Spark optimisation for Settlement is not necessarily right for NHH." Optimisations that suit the finest-grain stream (e.g., aggressive incremental CDF, broadcast joins on small ref tables, AQE-driven dynamic partition coalescing) can be wrong for the coarsest-grain stream. A monolithic pipeline can't apply both profiles.
The shared upstream consumption layer¶
The architectural complement to "each stream at its own grain" is that the underlying source-of-truth layer lives at the finest grain available — half-hourly in the Octopus case — and each stream selects the grain it needs from that shared substrate.
"Underpinning all three is the downstream consumption layer: a unified, multi-grain source of truth consolidating meter reads, smart meter data, and industry flows at multi-terabyte scale. This layer is the reconciliation bridge between monthly billing and half-hourly settlement."
The reconciliation bridge framing matters: monthly and half-hourly outputs must agree at the customer level, even though they're computed at different grains. The shared layer makes that agreement queryable.
Choosing a grain — checklist¶
When designing a pipeline (or refactoring a misaligned one):
- Enumerate consumers. What downstream system reads each output, and at what frequency / resolution?
- Map each consumer to its natural grain. Don't assume the finest grain is the right one for everyone.
- Identify the finest grain among all consumers. That's what the source-of-truth layer must hold.
- Split processing per consumer-grain. Each grain becomes a stream; each stream is independently tunable.
- Pick a reconciliation bridge. A shared upstream layer that serves all grains and lets cross-grain consistency queries run.
- Orchestrate. A "Job of Jobs" (or equivalent higher-level scheduler) manages cross-stream dependencies and parallelism (see patterns/job-of-jobs-orchestration).
Generalised beyond energy¶
The Octopus source explicitly generalises:
"MHHS is a UK energy regulation. However, the pattern it represents — a regulatory or business event that multiplies data volume at a finer grain — is not unique to energy. Any time a system moves from monthly to daily, daily to real-time, or aggregate to transactional, the same dynamics apply."
Likely instances of the same shape elsewhere on the wiki (not yet canonicalised under this concept):
- Real-time pricing / inventory pushes that supplement nightly batch catalogue pipelines.
- Streaming feature stores supplementing daily-batch model training pipelines.
- Per-event observability supplementing aggregated daily metrics.
Seen in¶
- sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — first canonical disclosure. Three natural grains (settlement-HH, smart-tariff-HH, standard-tariff-monthly), three streams, one unified multi-grain source-of-truth layer at HH grain.