SYSTEM Cited by 1 source
Octopus Energy Margin Data Pipeline¶
The Octopus Energy Margin Data Pipeline is the data engineering system that calculates margin, settlement obligations, and commercial KPIs for Octopus Energy's 8M+ UK customers. It is the first wiki-canonicalised instance of a grain-aligned three-stream rebuild of a previously monolithic finest-grain pipeline, triggered by the UK's regulatory move to Market-wide Half-Hourly Settlement (MHHS). Built on Delta Lake / Apache Spark / Databricks Serverless by a team of three over three months. (Source: sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction)
Why it exists¶
The UK's MHHS regulation moves every household from 2 meter reads / month → 48 reads / day — a 48× increase in data points per customer driving every margin calculation, settlement obligation, and commercial decision. Without re-architecture, the projected legacy infrastructure cost was +$1M / yr, with a per-settlement-date cost of $23.63 (a 33× increase from the legacy norm of $0.71).
The instinct to throw compute at the volume increase was rejected as untenable. "You can't just throw more compute at a problem like this. You have to rebuild and rethink your logic from the ground up." (Saad Ali, Lead of the Margin Data Team)
The deeper diagnosis was grain misalignment: the legacy pipeline ran everything at a single (monthly) grain because billing was monthly. MHHS introduced a fundamental split — industry settlement at half-hourly, smart-tariff revenue at half-hourly, standard-tariff revenue still monthly — and "running all three through a single monolithic pipeline meant processing the entire dataset on every run, regardless of what had actually changed".
Architecture¶
Three streams, one source of truth¶
Industry data flows (HH grain, 48/day/customer)
│
▼
┌──────────────────────────────────────┐
│ Unified multi-grain source-of-truth │
│ (multi-terabyte) — meter reads + │ ← incremental via Delta CDF
│ smart meter data + industry flows │ (25 B → 300 M rows/run)
│ reconciliation bridge between │
│ monthly billing and HH settlement │
└─────────────────┬────────────────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────┐
│Settlement│ │ Half-Hourly │ │ Monthly │
│ (HH grain│ │ (smart │ │(standard │
│ for │ │ tariff: │ │ tariff) │
│ cost │ │ EV, heat │ │ │
│ alloc.) │ │ pump, ToU) │ │ │
└──────────┘ └──────────────┘ └──────────┘
└─── Job of Jobs orchestration ───┘
(dependencies + parallel execution,
each stream independently tunable)
The three streams (verbatim grains)¶
| Stream | Grain | Purpose |
|---|---|---|
| Settlement | Half-hourly | Regulatory settlement and cost allocation; "Industry charges at 48 data points per day; this stream matches that grain exactly." |
| Half-Hourly | Half-hourly | "Smart tariff customers: EV drivers, heat pump users, and time-of-use products where the half-hourly price signal is the entire commercial proposition." |
| Monthly | Monthly | "Standard tariff customers, unchanged in grain but now reconcilable against the half-hourly data." |
Each stream is independently tunable — "what works as a Spark optimisation for Settlement is not necessarily right for NHH" (NHH = Non-Half-Hourly, the monthly stream).
The article generalises this stream-per-grain split as patterns/grain-aligned-stream-split — the three-streams pattern applies any time a regulatory or business event multiplies data volume at a finer grain than the existing pipeline assumes.
"Job of Jobs" orchestration¶
A "Job of Jobs" orchestration pattern manages dependencies and parallel execution across all three streams. Canonicalised as patterns/job-of-jobs-orchestration: the higher-level scheduler preserves each constituent pipeline's independent tuning profile while enforcing inter-stream order. The article does not name the orchestration runtime explicitly — likely Databricks Workflows / Lakeflow Jobs.
Unified multi-grain source-of-truth layer¶
"Underpinning all three is the downstream consumption layer: a unified, multi-grain source of truth consolidating meter reads, smart meter data, and industry flows at multi-terabyte scale. This layer is the reconciliation bridge between monthly billing and half-hourly settlement — and it became the site of the single highest-leverage optimisation in the project."
The reconciliation bridge framing is load-bearing: monthly billing and half-hourly settlement must agree at the customer level, even though they're computed at different grains. The shared layer is what makes that agreement queryable.
The CDF incremental win (single highest-leverage optimisation)¶
The naive shape — reprocessing the entire multi-terabyte upstream-consumption dataset on every run — was the original cost sink. Delta Lake's Change Data Feed made true incremental processing viable at multi-terabyte grain.
| Metric | Before (full overwrite) | After (CDF) | Change |
|---|---|---|---|
| Rows processed / run | 25 billion | 300 million | 98.8% reduction |
| Data freshness | Weekly | Daily | 7× improvement |
"the pipeline now reads only records that have actually changed since the last run"
Important: the headline $1M / yr annualised cost avoidance figure excludes the upstream-incremental savings. The full efficiency gain is materially larger. Canonicalised as patterns/cdf-incremental-replacing-full-rescan.
Spark / Delta optimisation taxonomy (four named categories)¶
1. Lineage and I/O reduction¶
- "Simplified lineage by consolidating data early in the pipeline, reducing downstream joins and shuffle operations".
- "Data pruning: selected only the columns strictly necessary for settlement and pruned rows at the earliest possible stage, reducing I/O overhead before expensive transformations."
2. Join and partition tuning¶
- Broadcast joins for reference tables under 500 MB, "eliminating expensive shuffle operations on complex multi-key joins with date ranges". Canonicalised as patterns/broadcast-join-for-small-reference-tables — the 500 MB threshold is disclosed verbatim.
- Liquid clustering "enabled across multiple tables for columns frequently used in filters and joins". The verbatim rationale: "Liquid clustering dynamically co-locates related records on the specified clustering keys without requiring fixed partition boundaries. Liquid clustering avoids the small-file problem, higher memory consumption, and I/O overhead that come from over-partitioning."
3. Trusted the optimiser¶
- "In several cases, Spark's Adaptive Query Execution (AQE) outperformed hand-tuned logic. The team removed custom optimisation code and let AQE do its job." Canonicalised as systems/spark-aqe — the optimiser as a system actor; the action item is to delete code, not write more.
4. Remove before you add¶
- "removing unjustified compute operations was as impactful as adding new optimisations. If you are running Z-ordering or ANALYZE without measuring their effect, they may be costing you more than they are saving." Canonicalised as concepts/remove-before-add-optimization — the measurement-required principle that turns existing optimisations from sacred-cows into hypotheses.
Development-velocity disclosure¶
Databricks Serverless is named as the development-loop enabler — not just a runtime. The team's verbatim framing:
"The testing and development process could not have been done without serverless. Using the serverless UI helped us to identify bottlenecks and make easy comparisons between different runs."
Two specific affordances:
- Zero cluster startup time — "the team could iterate rapidly — write, run, measure, adjust — without waiting for infrastructure to provision".
- Side-by-side run comparisons in the Serverless UI — "making it practical to isolate the effect of individual optimisations".
This is a new face on Databricks Serverless distinct from the stability-as-system-property framing from the 2026-05-06 source — the iteration-loop accelerator face. Three engineers / three months is partly attributed to this property.
Operational results (verbatim from the article's table)¶
| Metric | Before | After | Change |
|---|---|---|---|
| Rows processed per run | 25 billion | 300 million | 98.8% reduction |
| Cost per settlement date (projected MHHS) | $23.63 | $0.48 | ~50× reduction |
| Cost per settlement date (vs legacy) | $0.71 | $0.48 | 2× more efficient |
| Savings per month-end run | — | ~$83,000 | vs unoptimised projection |
| Annualised cost avoidance | — | ~$1,000,000 | excludes upstream savings |
| Data freshness | Weekly | Daily | 7× improvement |
| Build time | — | 3 months | Team of three |
"The $0.48 per settlement date is not just a 50x reduction from the MHHS projected cost — it is 2x cheaper than the legacy system had ever been, despite processing 48x more data points. Re-architecture delivered regulatory compliance and made the system materially more efficient than the one it replaced."
What this case canonicalises on the wiki¶
- concepts/grain-misalignment — first canonical instance.
- concepts/data-pipeline-grain — first canonical instance.
- concepts/remove-before-add-optimization — first canonical instance.
- patterns/grain-aligned-stream-split — first canonical instance.
- patterns/cdf-incremental-replacing-full-rescan — first canonical instance (vs medallion Bronze→Silver use of CDF).
- patterns/broadcast-join-for-small-reference-tables — first canonical instance with explicit size threshold.
- patterns/job-of-jobs-orchestration — first canonical instance.
- systems/spark-aqe — first dedicated wiki page (previously only mentioned in concept tags).
- systems/liquid-clustering — first dedicated wiki page (previously only mentioned in tags across half a dozen pages).
Seen in¶
- sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction — first and only wiki disclosure (as of 2026-05-23). Grain-aligned three-stream rebuild + Delta CDF upstream incremental + Spark/AQE
- serverless development velocity. $0.48 / settlement date, ~$1M / yr cost avoidance, 3 months, team of three.
Related¶
- Companies: companies/octopus-energy · companies/databricks
- Systems: systems/delta-lake · systems/apache-spark · systems/spark-aqe · systems/liquid-clustering · systems/databricks-serverless-compute
- Concepts: concepts/grain-misalignment · concepts/data-pipeline-grain · concepts/delta-change-data-feed · concepts/remove-before-add-optimization
- Patterns: patterns/grain-aligned-stream-split · patterns/cdf-incremental-replacing-full-rescan · patterns/broadcast-join-for-small-reference-tables · patterns/job-of-jobs-orchestration