CONCEPT Cited by 1 source
Notebook Experimentation Platform¶
Definition¶
A notebook experimentation platform is a hosted, multi-user, pre-wired notebook environment offered by an internal ML or data-platform team as the default first-contact surface for data scientists and ML engineers. Its purpose is to eliminate per-laptop setup of Python / R environments, SDKs, and credentials so that "users are ready to start experimenting in less than a minute" (sources/2022-04-18-zalando-zalandos-machine-learning-platform).
Distinguishing properties¶
- Hosted, not per-laptop. Runs on platform-team-operated infrastructure; users access via a browser.
- Multi-user, multi-project. Typically built on systems/jupyterhub (which spawns per-user systems/jupyterlab instances).
- Multi-tool, not just Jupyter. Commonly bundles Jupyter + R Studio + web-based shell, sometimes a few additional domain-specific tools.
- Pre-wired data-source credentials. The central team pre-configures access to the org's data lake (S3), warehouse (BigQuery / Redshift / Snowflake), BI tool (MicroStrategy, Tableau), and feature store, so users never install an SDK or wire credentials.
- Scoped to prototyping and interactive analysis. Explicitly not the production pipeline authoring surface, and not a big-data distributed-compute substrate — those are separate platforms (see the canonical Zalando three-substrate split below).
Why it exists — the pain being solved¶
Before a hosted experimentation platform, a new data scientist or applied scientist joining a team typically burns days on:
- Installing Python / conda / R / JVM locally.
- Installing + configuring cloud SDKs (AWS CLI, gcloud, etc.).
- Getting credentials for S3 / BigQuery / internal BI tools; each with its own policy review.
- Keeping the laptop's library versions aligned with a shifting production stack.
- Losing notebooks to laptop failures; lacking a team-sharing surface.
A hosted notebook platform amortises all of that across the org — done once by the platform team, consumed by hundreds of practitioners. Zalando's explicit goal: "ready to start experimenting in less than a minute."
Canonical three-substrate split (Zalando 2022)¶
The Zalando Datalab disclosure (sources/2022-04-18-zalando-zalandos-machine-learning-platform) pairs the notebook platform with two complementary substrates for workload shapes notebooks cannot handle:
| Workload | Substrate |
|---|---|
| Prototyping, quick feedback, interactive analysis | Notebook platform (Jupyter + R Studio) |
| Big-data Spark / feature derivation on TB-scale historical data | systems/databricks |
| Compute-vision, large-model training (GPU-bound) | GPU HPC cluster |
This three-substrate split is a better-named description of how large ML orgs actually partition experimentation — the notebook platform is never enough by itself, but it is always the first surface.
Relation to production pipelines¶
A notebook experimentation platform is always paired with a separate, code-first pipeline authoring tool. At Zalando, that tool is systems/zflow — a Python DSL committed to git, not notebooks committed to git. The gap between notebook prototyping and production pipeline is one of the canonical ML platform design questions:
"One of the most frequently discussed problems in machine learning is crossing the gap between experimentation and production, or in more crude terms: between a notebook and a machine learning pipeline." (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
The notebook platform addresses only the first half. The pipeline tool bridges the gap.
Named instances on this wiki¶
- systems/datalab-zalando — Zalando's internal brand for their hosted JupyterHub + R Studio environment with pre-wired S3 / BigQuery / MicroStrategy access. Canonical wiki instance.
- systems/jupyterhub — the vanilla multi-user JupyterHub substrate commonly underlying company-branded platforms.
Related¶
- systems/jupyterhub · systems/jupyterlab · systems/datalab-zalando
- systems/databricks — the complementary big-data substrate.
- companies/zalando