Skip to content

CONCEPT Cited by 1 source

Heterogeneous code migration

Heterogeneous code migration is the failure mode where a body of code that needs to be migrated from one platform to another is so non-uniform across instances that no fixed set of deterministic rewrite rules can correctly handle it. The migration's input space is structurally diverse — different code bases reference different data sources, use different custom libraries, embody different business logic, and were authored by different people over different time horizons. Adding more rules cannot close the gap because the gap is not in rule coverage; it is in the non-uniformity of the input itself.

This concept is canonicalised in the 2026-05-19 Deutsche Börse / Databricks customer-blog post as the design constraint that ruled out a rule-based migration engine for Deutsche Börse's ~95%-of-Clearing-and-Trading-data Zeppelin notebook estate.

The canonical articulation

The diversity across the entire notebook landscape made a rule-based rewriting engine impractical, since the logic was simply too heterogeneous and too business-specific for automated rules to handle reliably. Each one reflected institutional knowledge from the business teams who relied on it.

(— Deutsche Börse / Databricks, 2026-05-19)

Why heterogeneity defeats rule engines, structurally

A rule engine works by enumerating input shapes it knows how to transform. It scales gracefully when:

  • The input space is bounded. A finite enumeration of shapes covers the cases.
  • The shapes are recurrent. Most inputs match known patterns; the long tail is small.
  • The shapes are externally specified. Documentation defines them; new shapes don't appear ad hoc.

A heterogeneous migration violates all three. Each notebook (or service, or query) looks different from every other notebook because the variability is not in the migration tool's coverage — it is in the customer data:

  • Different teams chose different conventions over years.
  • Custom interpreters / extensions / helper imports embody team-specific knowledge.
  • Data-source references reflect operational decisions made at notebook-authoring time.
  • The "code" includes institutional knowledgewhy a particular SQL aggregation is correct for this business case, not just what the SQL looks like.

The number of distinct input shapes scales with number of authors × time × business surface, not with platform size. No finite ruleset matches.

  • vs deterministic format conversion — converting the container of code (paragraph→cell, JSON-schema reformat, interpreter prefix mapping) is deterministic and rule-tractable. The problem is converting the content that lives inside the container. See concepts/notebook-format-migration for the structural-vs-logical split.
  • vs schema migration — schema migrations (online DDL, catalog updates) operate on a finite set of types and constraints with externally-specified semantics. Heterogeneous code migration operates on an open-ended set of business-logic shapes.
  • vs code translation between languages — machine-driven Python→Java or COBOL→Java translation is heterogeneous within a language but bounded by language semantics. Notebook-content migration adds the orthogonal axis of environment-specific references (custom interpreters, internal helper modules, in-house data conventions) that have no analog on the destination platform.
  • vs simple refactoring — refactorings are local, mechanical, and target-shape-known. Heterogeneous migration target shapes are unknown until the input is inspected.

What works instead: split the problem

The architectural fix is to separate the deterministic and heterogeneous sub-problems and apply different mechanisms to each. The deterministic part (structure, container, format) goes to a rule engine; the heterogeneous part (business logic, references, custom dependencies) goes to a context-grounded LLM that can interpret per-instance content. The seam between the two is the design surface.

This is canonicalised at patterns/structural-deterministic-logical-llm-split. The pattern's load-bearing claim is that only the LLM stage scales with non-uniformity, but only the rule stage is cheap and deterministic on the recurring structural transforms. Forcing either to do the other's job is the failure mode.

Why this concept is timely in 2026

Three converging factors make heterogeneous-code-migration a frequent enterprise problem in 2026:

  • EOL-driven platform consolidation. Cloudera Zeppelin 2027, mainframe-language deprecations, vendor-platform exits. Migration is no longer optional.
  • LLM capability has crossed the threshold. Per-instance context interpretation (the missing piece in 2018-era rule-based migrations) is now operationally tractable with grounded prompts and clarifying-question loops. See concepts/context-encoded-llm-prompt and systems/databricks-genie.
  • Self-service-for-business-users is now the deployment target. Enterprise migrations are no longer scoped to engineering teams of dozens; they target thousands of business users who cannot follow engineering-grade migration playbooks. The hybrid rule + LLM design specifically enables business-user self-service that pure rule-based or pure manual migration cannot.

Seen in

  • 2026-05-19 — Deutsche Börse Zeppelin migration. (Source: sources/2026-05-19-databricks-deutsche-borse-zeppelin-to-databricks-notebook-migration.) Canonical first-wiki appearance. The team explicitly named heterogeneity as the design constraint that ruled out a rule-based engine for the body of Zeppelin notebooks at StatistiX. They quote the structural insight: "the logic was simply too heterogeneous and too business-specific for automated rules to handle reliably." The fix: apply rules to the deterministic structural conversion; delegate the heterogeneous logic-reconstruction stage to Genie via a context-encoded prompt. Hours-to-minutes per notebook on a 2,000-user migration.
Last updated · 542 distilled / 1,571 read