Skip to content

CONCEPT Cited by 1 source

Data pipeline grain

Data pipeline grain is the time (or other dimensional) resolution at which a pipeline processes its inputs and emits its outputs. Every consumer of the pipeline has a natural grain — the resolution at which decisions are made or obligations are enforced — and the architectural question is whether one shared pipeline grain serves every consumer or whether the pipeline must split into multiple grain-specialised streams. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)

What "grain" means here

Grain is a resolution / cardinality property of pipeline inputs and outputs:

  • Input grain — how frequently or finely data points arrive (e.g., 2 meter reads / customer / month → 48 reads / customer / day).
  • Output grain — the resolution at which the pipeline emits results to a consumer (e.g., a monthly bill row vs a half-hourly settlement row).
  • Processing grain — the resolution the pipeline operates at internally. In a monolithic pipeline this is set by the finest-grain consumer; in a grain-aligned split it varies per stream.

Grain isn't always temporal — for hierarchical entities it can be "per-customer" vs "per-meter" vs "per-circuit" — but the canonical case in the Octopus rebuild is time-grain: half-hourly vs monthly.

The Octopus three grains (canonical instance)

The Octopus margin pipeline canonicalised three distinct natural grains on the wiki:

Stream Grain Why this is the natural grain
Settlement Half-hourly "Industry charges at 48 data points per day; this stream matches that grain exactly."
Half-Hourly Half-hourly "Smart tariff customers: EV drivers, heat pump users, and time-of-use products where the half-hourly price signal is the entire commercial proposition."
Monthly Monthly "Standard tariff customers, unchanged in grain but now reconcilable against the half-hourly data."

Two streams share a grain (HH) but not a purpose; the third is at a fundamentally coarser grain (monthly). All three converge on a unified multi-grain source-of-truth layer that holds inputs at the finest available grain and serves them to each stream at the stream's chosen processing grain.

Why the choice of grain is load-bearing

Two named consequences in the Octopus source:

  1. Cost is set by grain. A pipeline running at half-hourly grain processes ~48× the data points per customer per day vs a monthly pipeline. If the grain is finer than the consumer needs, you pay the volume tax for nothing. (See concepts/grain-misalignment for the failure mode.)
  2. Tuning is set by grain. "Each stream is independently tunable — what works as a Spark optimisation for Settlement is not necessarily right for NHH." Optimisations that suit the finest-grain stream (e.g., aggressive incremental CDF, broadcast joins on small ref tables, AQE-driven dynamic partition coalescing) can be wrong for the coarsest-grain stream. A monolithic pipeline can't apply both profiles.

The shared upstream consumption layer

The architectural complement to "each stream at its own grain" is that the underlying source-of-truth layer lives at the finest grain available — half-hourly in the Octopus case — and each stream selects the grain it needs from that shared substrate.

"Underpinning all three is the downstream consumption layer: a unified, multi-grain source of truth consolidating meter reads, smart meter data, and industry flows at multi-terabyte scale. This layer is the reconciliation bridge between monthly billing and half-hourly settlement."

The reconciliation bridge framing matters: monthly and half-hourly outputs must agree at the customer level, even though they're computed at different grains. The shared layer makes that agreement queryable.

Choosing a grain — checklist

When designing a pipeline (or refactoring a misaligned one):

  • Enumerate consumers. What downstream system reads each output, and at what frequency / resolution?
  • Map each consumer to its natural grain. Don't assume the finest grain is the right one for everyone.
  • Identify the finest grain among all consumers. That's what the source-of-truth layer must hold.
  • Split processing per consumer-grain. Each grain becomes a stream; each stream is independently tunable.
  • Pick a reconciliation bridge. A shared upstream layer that serves all grains and lets cross-grain consistency queries run.
  • Orchestrate. A "Job of Jobs" (or equivalent higher-level scheduler) manages cross-stream dependencies and parallelism (see patterns/job-of-jobs-orchestration).

Generalised beyond energy

The Octopus source explicitly generalises:

"MHHS is a UK energy regulation. However, the pattern it represents — a regulatory or business event that multiplies data volume at a finer grain — is not unique to energy. Any time a system moves from monthly to daily, daily to real-time, or aggregate to transactional, the same dynamics apply."

Likely instances of the same shape elsewhere on the wiki (not yet canonicalised under this concept):

  • Real-time pricing / inventory pushes that supplement nightly batch catalogue pipelines.
  • Streaming feature stores supplementing daily-batch model training pipelines.
  • Per-event observability supplementing aggregated daily metrics.

Seen in

Last updated · 542 distilled / 1,571 read