Skip to content

CONCEPT Cited by 1 source

Notebook format migration

Notebook format migration is the act of moving an interactive analytics notebook (and its embedded code, visualisations, widgets, and scheduling) from one vendor's notebook format to another. Common shapes include:

  • Apache Zeppelin → Databricks (.ipynb JSON variant)
  • Apache Zeppelin → Jupyter (.ipynb)
  • Jupyter → Databricks
  • Databricks → Jupyter
  • Notebook → executable script (e.g. .py extraction)

The migration looks superficially trivial — both sides are "just notebooks" — but in practice it has two structurally different sub-problems with different cost profiles. Conflating them is the first-order failure mode.

The two sub-problems

1. Structural translation (deterministic)

The container format differs across notebook systems even when the underlying code is the same:

  • Execution unit naming. Zeppelin's paragraph and Databricks' cell play the same role; the rename is mechanical.
  • Per-paragraph language directives. Zeppelin uses %python / %sql / %pyspark magics; Databricks uses cell-level language tags. Mapping is finite.
  • Notebook-level metadata. Kernel info, layout, scheduling configuration are serialised differently per format. Mapping is finite.
  • JSON schema differences. Zeppelin notebook JSON is not Jupyter .ipynb JSON; the wrapping schema must be rewritten even when paragraph contents are identical.

All of this is deterministic and rule-driven. A correctly-implemented structural converter handles it with no human input.

2. Logical translation (heterogeneous)

What lives inside the cells / paragraphs is unique per notebook and per environment:

  • SQL bodies that reference tables in the source environment (HDFS paths, schema names, custom catalogs) and need to be rewritten against the destination environment (Unity Catalog tables, Delta paths, S3 buckets).
  • Python code that imports source-environment-specific libraries, helper modules, or pre-bound clients.
  • Custom interpreter dependencies. Zeppelin's pluggable-interpreter ecosystem means a single notebook can depend on a customer-specific magic that has no analog on the destination platform.
  • Visualisations and widgets built on platform-specific APIs.
  • Scheduling logic referencing platform-specific job runners.

This part is heterogeneous, business-specific, and resists rules — it varies per notebook because each notebook "reflects institutional knowledge from the business teams who relied on it." (See concepts/heterogeneous-code-migration for why this defeats rule-based migration engines.)

Why splitting the two sub-problems matters

The temptation is to attack the whole migration with one mechanism. Two pure approaches and their failure modes:

  • Pure rule-based engine — handles structural conversion well, fails on heterogeneous logic. Adding more rules cannot close the gap because the input space is non-uniform. Failure mode: silent miscompilations of business-critical logic, or a long tail of edge cases that consume engineering effort indefinitely.
  • Pure LLM rewrite — handles heterogeneous logic in principle, but rewrites the deterministic structural parts from scratch every time. Failure modes: non-determinism in output where determinism was achievable, hallucination on parts that should not have been rewritten, opaque diff against the original (the user can't easily verify the LLM didn't subtly change the SQL), unnecessary token cost on parts a rule could have handled deterministically.

The architectural fix is to separate the two sub-problems, apply rules to the deterministic side, apply an LLM to the heterogeneous side, and design the handoff between them carefully. See patterns/structural-deterministic-logical-llm-split.

Why notebook migration is now timely

A wave of vendor-end-of-life events is forcing notebook migrations at large enterprises:

  • Cloudera Zeppelin 2027 EOL — the canonical 2026-disclosed example, Deutsche Börse is one of presumably many Cloudera tenants whose interactive-analytics surface will disappear and must move.
  • Cloud-platform consolidation. Customers moving from on-premise Hadoop / Spark stacks to cloud-native lakehouses encounter a notebook-format mismatch even when the code is portable.
  • Self-service analytics scope. Modern enterprise data platforms target thousands of business users, not dozens of analysts. The migration tool has to be usable without engineering supervision (see business-user self-service in systems/deutsche-borse-zeppelin-converter).

Seen in

Last updated · 542 distilled / 1,571 read