PATTERN Cited by 1 source
Structural-deterministic, logical-LLM split¶
A migration pattern where a heterogeneous code-migration problem is decomposed into two sub-problems with very different cost and correctness profiles, and a different mechanism is applied to each:
- Structural conversion — deterministic, rule-driven. Maps one container format to another (paragraph → cell, schema reformat, interpreter prefix translation, metadata rewrite). Original payload content is preserved verbatim. Compiles cleanly to rules because the input space of structural transforms is finite, recurrent, and externally specified.
- Logical reconstruction — heterogeneous, per-instance. Translates business-specific code, references, custom dependencies, and embedded knowledge against the destination platform's primitives. Resists rules because the input space scales with author × time × business surface. Handed off to an LLM agent that can interpret per-instance content and ask clarifying questions.
The architectural insight is that what looks like one migration is actually two — and forcing either mechanism (rules or LLM) to do the other's job produces well-known failure modes. The handoff between the two stages is the load-bearing design surface.
Canonicalised in the 2026-05-19 Deutsche Börse / Databricks customer-blog post for a 2,000-user Zeppelin-on-Cloudera → Databricks migration.
The thesis (verbatim)¶
Structural conversion (mapping Zeppelin's paragraph format to Databricks cells, translating interpreter syntax, reformatting metadata) is deterministic and automatable, while logic reconstruction is not. Thankfully, LLMs are great at this structural conversion part… This hybrid approach of automating the deterministic part and delegating the variable part allows us to avoid the brittleness of rule-based systems and leverage AI where it actually performs well.
(— Deutsche Börse / Databricks, 2026-05-19)
Architecture¶
Source artifact (heterogeneous content in source-format container)
|
v
+------------------------------+
| STAGE 1: Structural converter | -- deterministic, rule-driven
| |
| - container format remap |
| - prefix / magic mapping |
| - metadata schema rewrite |
| - LOGIC PRESERVED VERBATIM |
+------------------------------+
|
+--> Destination-format container with original logic intact
|
+--> Context-encoded prompt template (handoff)
|
v
+------------------------------+
| STAGE 2: LLM reconstruction | -- heterogeneous, per-instance
| |
| - interprets context |
| - asks clarifying questions |
| - rewrites logic against |
| destination primitives |
+------------------------------+
|
v
Reconstructed artifact (destination-format container,
destination-native logic)
The seam between Stage 1 and Stage 2 is intentionally a clean handoff. In the canonical instance, the handoff is a prompt string the user pastes from the converter app into Genie — the human-in-the-loop step is the user's chance to inspect the structurally-converted artifact before logic reconstruction begins. See patterns/context-encoded-prompt-handoff for the prompt-handoff specifics.
What Stage 1 must do¶
- Translate the container format completely. Every artifact submitted should produce a destination-format container that opens correctly in the destination tool, even if the internal logic is unmodified.
- Apply only finite, well-specified mappings. Every rule must have an externally-documented basis (source-format-spec → destination-format-spec). Rules that match on heuristics or partial-string patterns are out of scope; they belong to Stage 2.
- Preserve all logic content verbatim. SQL strings, code bodies, references, configuration values, comments, visualisation specs are copied byte-for-byte to the destination container.
What Stage 1 must NOT do¶
The Deutsche Börse post is unusually explicit about this:
The converter does not rewrite SQL logic, Python logic, visualizations, widgets, Oracle and HDFS references, scheduling logic or business-specific custom code. All of that content is preserved in the converted notebook, untouched, because rewriting it automatically would introduce errors and undermine trust in the output.
The deliberate decision not to rewrite is the negative-space discipline that makes the pattern work. Once the rule engine starts trying to rewrite logic, it accumulates the same long-tail failure modes that doomed pure-rule-based migration in the first place. The structural converter must hold the line.
What Stage 2 must do¶
- Receive the structurally-converted artifact and the context-encoded prompt. See concepts/context-encoded-llm-prompt.
- Interpret the per-instance logic against the destination platform's primitives. The LLM has destination-platform knowledge from training; the prompt provides operator-environment knowledge.
- Ask clarifying questions. When the input is ambiguous (e.g. a reference to a custom interpreter not covered in the prompt context), Stage 2 asks rather than guesses. The clarifying-question loop is what handles the inevitable tail of unencoded environment knowledge.
Why this beats both pure approaches¶
| Approach | Structural side | Logical side |
|---|---|---|
| Pure rule-based | ✓ Cheap, deterministic, fast | ✗ Long tail of edge cases; silent miscompilations of business logic; engineering effort consumed indefinitely |
| Pure LLM rewrite | ✗ Non-deterministic where determinism was achievable; opaque diff against original; unnecessary token cost | ✓ Handles heterogeneous logic |
| Hybrid (this pattern) | ✓ Rules handle deterministic conversion cheaply | ✓ LLM handles heterogeneous logic with operator context |
The pattern preserves the strengths of each mechanism on its own side of the seam.
Trade-offs¶
- Two-stage UX has friction. The user runs the converter, downloads the output, opens the destination platform, pastes the prompt, drives the clarifying-question loop. The friction is the cost of the human-in-the-loop seam — but it doubles as the inspection point that lets the user verify Stage 1's output before Stage 2 modifies it.
- The seam must be explicit and visible. The user must know what Stage 1 did and didn't do, and what Stage 2 will do. The Deutsche Börse post invests heavily in user-facing documentation of the seam.
- The context block is operator-specific. The prompt template must be maintained against the operator's actual environment; drift produces output-quality regression. See failure modes under concepts/context-encoded-llm-prompt.
- Stage 2 is non-deterministic. Even with a good context block, two runs of Stage 2 may produce different reconstructions. The pattern accepts this in exchange for handling heterogeneity.
- Trust calibration. Users may over-trust LLM output. The two-stage UX with a visible seam is the partial mitigation; rigorous output review is the full mitigation.
When to use this pattern¶
- The migration source contains both a deterministic structural component AND heterogeneous, business-specific logic.
- The body of artifacts is large enough that pure manual migration is infeasible (years of engineering).
- The body of artifacts is heterogeneous enough that pure rule-based migration is infeasible (long tail of edge cases).
- The destination platform has an LLM agent (or one is available) that can be grounded with operator-specific context.
- The migration is a one-time motion, not an ongoing pipeline (per-run cost of the LLM stage is amortised against the value of the migration, not against ongoing operations).
When NOT to use this pattern¶
- When the migration corpus is small enough that manual migration is cheaper than building the tool.
- When the corpus is uniform enough that pure rule-based migration suffices (e.g. format-version upgrade within the same product).
- When the migration must be deterministic and reproducible (compliance / audit requirements that preclude an LLM in the path).
- When operator-specific context cannot be enumerated (the prompt template would be empty or generic).
Sibling patterns in the wiki¶
- patterns/hybrid-classical-er-plus-genai — same shape at entity-resolution altitude. Classical ER handles the deterministic blocking + comparison part; GenAI handles the heterogeneous semantic-judgement part. Same insight: split the problem and apply the right mechanism.
- patterns/two-pass-classify-then-deep-extract — Databricks 2026-05-11 Unlocking the Archives. Cheap classifier handles the bounded routing decision; expensive multimodal LLM handles the heterogeneous extraction. Same insight at document-extraction altitude.
- patterns/llm-judge-as-inline-pipeline-stage — same shape but with the LLM as a quality gate rather than a transformation stage.
Sibling instance: agentic architecture rejected for this shape¶
The Deutsche Börse team explicitly rejected an agentic architecture in favour of this two-stage linear pipeline. Their first attempt was "a more complex agentic architecture that added overhead without solving the core problem"; they discarded it for "a simple UI and a clean backend". This pattern's correctness depends on the migration task being well-bounded enough that a linear two-stage pipeline (Stage 1 → handoff → Stage 2) suffices. For unbounded tasks where the per-instance reconstruction is itself recursive (e.g. agent-driven repository-wide refactoring), an agent loop is the right shape; for bounded one-shot conversions, the linear pipeline wins. (Source: sources/2026-05-19-databricks-deutsche-borse-zeppelin-to-databricks-notebook-migration.)
Seen in¶
- 2026-05-19 — Deutsche Börse Zeppelin → Databricks migration. (Source: sources/2026-05-19-databricks-deutsche-borse-zeppelin-to-databricks-notebook-migration.) Canonical first-wiki appearance. Stage 1 = Zeppelin to Databricks Notebook Converter (Databricks App, shadcn UI frontend, deterministic paragraph→cell + interpreter-prefix mapping +
.ipynbJSON reformat). Stage 2 = Genie (LLM agent receiving the context-encoded prompt + clarifying-question loop). The split reduces per-notebook redevelopment from hours to 15–20 minutes and enables business-user self-service migration of 2,000+ users from Cloudera Zeppelin (EOL 2027) to Databricks. Pattern's load-bearing architectural insight: "the diversity of logic across our notebooks made rules impractical; LLMs are essential for handling that variability and the key is designing the handoff between automation and AI thoughtfully."
Related¶
- patterns/context-encoded-prompt-handoff — the handoff mechanism between Stage 1 and Stage 2.
- patterns/hybrid-classical-er-plus-genai — sibling pattern at ER altitude.
- patterns/two-pass-classify-then-deep-extract — sibling pattern at document-extraction altitude.
- concepts/heterogeneous-code-migration — the failure mode that motivates the split.
- concepts/notebook-format-migration — the application context.
- concepts/context-encoded-llm-prompt — what flows across the seam.
- systems/apache-zeppelin — source format in the canonical instance.
- systems/databricks-genie — Stage 2 mechanism in the canonical instance.
- systems/deutsche-borse-zeppelin-converter — Stage 1 implementation in the canonical instance.