CONCEPT Cited by 1 source

Notebook Experimentation Platform¶

Definition¶

A notebook experimentation platform is a hosted, multi-user, pre-wired notebook environment offered by an internal ML or data-platform team as the default first-contact surface for data scientists and ML engineers. Its purpose is to eliminate per-laptop setup of Python / R environments, SDKs, and credentials so that "users are ready to start experimenting in less than a minute" (sources/2022-04-18-zalando-zalandos-machine-learning-platform).

Distinguishing properties¶

Hosted, not per-laptop. Runs on platform-team-operated infrastructure; users access via a browser.
Multi-user, multi-project. Typically built on systems/jupyterhub (which spawns per-user systems/jupyterlab instances).
Multi-tool, not just Jupyter. Commonly bundles Jupyter + R Studio + web-based shell, sometimes a few additional domain-specific tools.
Pre-wired data-source credentials. The central team pre-configures access to the org's data lake (S3), warehouse (BigQuery / Redshift / Snowflake), BI tool (MicroStrategy, Tableau), and feature store, so users never install an SDK or wire credentials.
Scoped to prototyping and interactive analysis. Explicitly not the production pipeline authoring surface, and not a big-data distributed-compute substrate — those are separate platforms (see the canonical Zalando three-substrate split below).

Why it exists — the pain being solved¶

Before a hosted experimentation platform, a new data scientist or applied scientist joining a team typically burns days on:

Installing Python / conda / R / JVM locally.
Installing + configuring cloud SDKs (AWS CLI, gcloud, etc.).
Getting credentials for S3 / BigQuery / internal BI tools; each with its own policy review.
Keeping the laptop's library versions aligned with a shifting production stack.
Losing notebooks to laptop failures; lacking a team-sharing surface.

A hosted notebook platform amortises all of that across the org — done once by the platform team, consumed by hundreds of practitioners. Zalando's explicit goal: "ready to start experimenting in less than a minute."

Canonical three-substrate split (Zalando 2022)¶

The Zalando Datalab disclosure (sources/2022-04-18-zalando-zalandos-machine-learning-platform) pairs the notebook platform with two complementary substrates for workload shapes notebooks cannot handle:

Workload	Substrate
Prototyping, quick feedback, interactive analysis	Notebook platform (Jupyter + R Studio)
Big-data Spark / feature derivation on TB-scale historical data	systems/databricks
Compute-vision, large-model training (GPU-bound)	GPU HPC cluster

This three-substrate split is a better-named description of how large ML orgs actually partition experimentation — the notebook platform is never enough by itself, but it is always the first surface.

Relation to production pipelines¶

A notebook experimentation platform is always paired with a separate, code-first pipeline authoring tool. At Zalando, that tool is systems/zflow — a Python DSL committed to git, not notebooks committed to git. The gap between notebook prototyping and production pipeline is one of the canonical ML platform design questions:

"One of the most frequently discussed problems in machine learning is crossing the gap between experimentation and production, or in more crude terms: between a notebook and a machine learning pipeline." (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)

The notebook platform addresses only the first half. The pipeline tool bridges the gap.

Named instances on this wiki¶

systems/datalab-zalando — Zalando's internal brand for their hosted JupyterHub + R Studio environment with pre-wired S3 / BigQuery / MicroStrategy access. Canonical wiki instance.
systems/jupyterhub — the vanilla multi-user JupyterHub substrate commonly underlying company-branded platforms.

systems/jupyterhub · systems/jupyterlab · systems/datalab-zalando
systems/databricks — the complementary big-data substrate.
companies/zalando