Architecting Scalable ML Platforms: The Integrated Infrastructure and Acceleration Behind Rovo¶

Summary¶

Atlassian describes the architecture of ML Studio, their enterprise-scale ML platform that standardizes modular development, centralizes workflow orchestration, and embeds governance directly into the execution layer. ML Studio serves as the mission-critical backbone for AI systems including Rovo Search and Chat, the Teamwork Graph, and Confluence, enabling thousands of production workflow runs daily and serving millions of users globally. The article details three architectural pillars: composable reusable modules (version-controlled ML building blocks), a workflow orchestrator (scheduling, cloning, nested workflows, hot clusters), and embedded compliance controls (user identity, domain-level, and column-level data classification enforcement).

Key Takeaways¶

Modular ML as versioned artifacts — Every code push produces a module artifact; artifact versioning with tags (latestAlpha, latest) allows rollback without redeploying infrastructure. Modules are self-contained, shareable across teams, and form a growing catalog of 2,000+ reusable blocks with 200k+ monthly iterations.
Local dev loop as productivity lever — Python module builds reduced from minutes to under 30 seconds using local builds + remote developer environments (RDEs). Local builds now represent a large share of daily builds, saving thousands of developer minutes per week.
Workflow Orchestrator as central scheduling substrate — Supports effortless cloning/rerun of prior runs, flexible triggering (portal, CLI, API), composable nested/joined workflows, hot clusters for rapid iteration, and CRON-based scheduling. Manages ~120k monthly workflow runs across 100+ ML teams.
Hot clusters eliminate provisioning latency — Pre-provisioned clusters remain active between runs, eliminating cluster spin-up wait time and enabling rapid iterative experimentation.
Automatic deterministic caching — Detects when a task has already run with identical parameters/inputs and reuses stored results. ~80% of workflows leverage caching daily, saving 1,000+ hours of execution time per month.
Multi-layer compliance embedded in execution — Three-layer governance: user identity-based access control, domain-level access control (experimentation vs. production), and column-level data classification with automatic tag propagation through pipeline stages. Default-deny for unclassified columns.
Experimental workflows as productivity multiplier — Account for over half of all ML Studio runs; direct access to approved datasets without PR review. Saves 100+ hours per day across Atlassian's Central AI org.
Cross-functional integration layer — ML Studio integrates with experiment tracking, central feature store, model registry, monitoring (ML Lens), deployment/serving platform, and other microservices via APIs. This reduces context-switching from experiment to production deployment.

Operational Numbers¶

Metric	Value
Monthly active Rovo users served	5 million+
Datasets generated (with access control)	900k+
Monthly workflow runs	~120k
ML teams using platform	100+
Monthly model iterations/experiments	~20k
Reusable ML modules	2,000+
Monthly module iterations	200k+
Workflows leveraging caching	~80% daily
Execution time saved by caching	1,000+ hours/month
Experimental workflow share	>50% of all runs
Daily hours saved by experimental workflows	100+
Local build time (Python modules)	<30 seconds

Systems Extracted¶

systems/atlassian-ml-studio — Atlassian's unified ML development platform (the subject of this post)
systems/databricks — External orchestration target; ML Studio orchestrates jobs across Databricks clusters

Concepts Extracted¶

concepts/ml-platform-architecture — Design principles for enterprise-scale ML infrastructure
concepts/workflow-orchestration — Central scheduling and execution management for ML pipelines
concepts/composable-ml-modules — Self-contained, versioned ML building blocks combined into pipelines
concepts/column-level-access-control — Data classification and enforcement at individual column granularity
concepts/automatic-task-caching — Deterministic cache keyed on task parameters and inputs to skip redundant computation
concepts/hot-cluster-reuse — Pre-provisioned compute clusters that stay active across runs to eliminate startup latency
concepts/ml-governance-at-scale — Multi-layer compliance framework embedded in ML execution

Patterns Extracted¶

patterns/module-as-versioned-artifact — Every code push produces a versioned artifact; tags enable rollback without infrastructure changes
patterns/hot-cluster-for-iterative-ml — Keep clusters warm between runs to eliminate provisioning delay during rapid experimentation
patterns/deterministic-task-caching — Cache task outputs keyed on (parameters + inputs); reuse on identical invocations
patterns/column-level-classification-propagation — Automatically propagate data classification tags from input columns to output columns through pipeline stages
patterns/nested-composable-workflows — Build complex ML pipelines from smaller reusable sub-workflows via nesting and joining
patterns/local-dev-loop-with-remote-parity — Local builds + remote developer environments that mirror production, with repository reserved for peer review

Caveats¶

Tier-3 source — Atlassian's blog mixes product marketing with architecture; this post is more architectural than typical.
No details on failure handling, retry semantics, or cluster autoscaling strategy.
No queue/backpressure design for the 120k monthly workflow runs.
No specifics on GPU cluster topology, distributed training frameworks, or training job scheduling algorithms.
"Hot clusters" mentioned without details on idle-timeout policy, cost management, or multi-tenant sharing.
Caching correctness guarantees not specified (hash algorithm, invalidation policy for side-effecting tasks).
Integration layer described at feature level, not at wire-protocol or API-contract level.
Fortune 500 adoption claims are marketing metrics, not architecture metrics.