Architecting Scalable ML Platforms: The Integrated Infrastructure and Acceleration Behind Rovo¶
Summary¶
Atlassian describes the architecture of ML Studio, their enterprise-scale ML platform that standardizes modular development, centralizes workflow orchestration, and embeds governance directly into the execution layer. ML Studio serves as the mission-critical backbone for AI systems including Rovo Search and Chat, the Teamwork Graph, and Confluence, enabling thousands of production workflow runs daily and serving millions of users globally. The article details three architectural pillars: composable reusable modules (version-controlled ML building blocks), a workflow orchestrator (scheduling, cloning, nested workflows, hot clusters), and embedded compliance controls (user identity, domain-level, and column-level data classification enforcement).
Key Takeaways¶
-
Modular ML as versioned artifacts — Every code push produces a module artifact; artifact versioning with tags (
latestAlpha,latest) allows rollback without redeploying infrastructure. Modules are self-contained, shareable across teams, and form a growing catalog of 2,000+ reusable blocks with 200k+ monthly iterations. -
Local dev loop as productivity lever — Python module builds reduced from minutes to under 30 seconds using local builds + remote developer environments (RDEs). Local builds now represent a large share of daily builds, saving thousands of developer minutes per week.
-
Workflow Orchestrator as central scheduling substrate — Supports effortless cloning/rerun of prior runs, flexible triggering (portal, CLI, API), composable nested/joined workflows, hot clusters for rapid iteration, and CRON-based scheduling. Manages ~120k monthly workflow runs across 100+ ML teams.
-
Hot clusters eliminate provisioning latency — Pre-provisioned clusters remain active between runs, eliminating cluster spin-up wait time and enabling rapid iterative experimentation.
-
Automatic deterministic caching — Detects when a task has already run with identical parameters/inputs and reuses stored results. ~80% of workflows leverage caching daily, saving 1,000+ hours of execution time per month.
-
Multi-layer compliance embedded in execution — Three-layer governance: user identity-based access control, domain-level access control (experimentation vs. production), and column-level data classification with automatic tag propagation through pipeline stages. Default-deny for unclassified columns.
-
Experimental workflows as productivity multiplier — Account for over half of all ML Studio runs; direct access to approved datasets without PR review. Saves 100+ hours per day across Atlassian's Central AI org.
-
Cross-functional integration layer — ML Studio integrates with experiment tracking, central feature store, model registry, monitoring (ML Lens), deployment/serving platform, and other microservices via APIs. This reduces context-switching from experiment to production deployment.
Operational Numbers¶
| Metric | Value |
|---|---|
| Monthly active Rovo users served | 5 million+ |
| Datasets generated (with access control) | 900k+ |
| Monthly workflow runs | ~120k |
| ML teams using platform | 100+ |
| Monthly model iterations/experiments | ~20k |
| Reusable ML modules | 2,000+ |
| Monthly module iterations | 200k+ |
| Workflows leveraging caching | ~80% daily |
| Execution time saved by caching | 1,000+ hours/month |
| Experimental workflow share | >50% of all runs |
| Daily hours saved by experimental workflows | 100+ |
| Local build time (Python modules) | <30 seconds |
Systems Extracted¶
- systems/atlassian-ml-studio — Atlassian's unified ML development platform (the subject of this post)
- systems/databricks — External orchestration target; ML Studio orchestrates jobs across Databricks clusters
Concepts Extracted¶
- concepts/ml-platform-architecture — Design principles for enterprise-scale ML infrastructure
- concepts/workflow-orchestration — Central scheduling and execution management for ML pipelines
- concepts/composable-ml-modules — Self-contained, versioned ML building blocks combined into pipelines
- concepts/column-level-access-control — Data classification and enforcement at individual column granularity
- concepts/automatic-task-caching — Deterministic cache keyed on task parameters and inputs to skip redundant computation
- concepts/hot-cluster-reuse — Pre-provisioned compute clusters that stay active across runs to eliminate startup latency
- concepts/ml-governance-at-scale — Multi-layer compliance framework embedded in ML execution
Patterns Extracted¶
- patterns/module-as-versioned-artifact — Every code push produces a versioned artifact; tags enable rollback without infrastructure changes
- patterns/hot-cluster-for-iterative-ml — Keep clusters warm between runs to eliminate provisioning delay during rapid experimentation
- patterns/deterministic-task-caching — Cache task outputs keyed on (parameters + inputs); reuse on identical invocations
- patterns/column-level-classification-propagation — Automatically propagate data classification tags from input columns to output columns through pipeline stages
- patterns/nested-composable-workflows — Build complex ML pipelines from smaller reusable sub-workflows via nesting and joining
- patterns/local-dev-loop-with-remote-parity — Local builds + remote developer environments that mirror production, with repository reserved for peer review
Caveats¶
- Tier-3 source — Atlassian's blog mixes product marketing with architecture; this post is more architectural than typical.
- No details on failure handling, retry semantics, or cluster autoscaling strategy.
- No queue/backpressure design for the 120k monthly workflow runs.
- No specifics on GPU cluster topology, distributed training frameworks, or training job scheduling algorithms.
- "Hot clusters" mentioned without details on idle-timeout policy, cost management, or multi-tenant sharing.
- Caching correctness guarantees not specified (hash algorithm, invalidation policy for side-effecting tasks).
- Integration layer described at feature level, not at wire-protocol or API-contract level.
- Fortune 500 adoption claims are marketing metrics, not architecture metrics.
Source¶
- Original: https://www.atlassian.com/blog/how-we-build/architecting-scalable-ml-platforms
- Raw markdown:
raw/atlassian/2026-06-10-architecting-scalable-ml-platforms-the-integrated-infrastruc-6b120abe.md