Skip to content

Architecting Scalable ML Platforms: The Integrated Infrastructure and Acceleration Behind Rovo

Summary

Atlassian describes the architecture of ML Studio, their enterprise-scale ML platform that standardizes modular development, centralizes workflow orchestration, and embeds governance directly into the execution layer. ML Studio serves as the mission-critical backbone for AI systems including Rovo Search and Chat, the Teamwork Graph, and Confluence, enabling thousands of production workflow runs daily and serving millions of users globally. The article details three architectural pillars: composable reusable modules (version-controlled ML building blocks), a workflow orchestrator (scheduling, cloning, nested workflows, hot clusters), and embedded compliance controls (user identity, domain-level, and column-level data classification enforcement).

Key Takeaways

  1. Modular ML as versioned artifacts — Every code push produces a module artifact; artifact versioning with tags (latestAlpha, latest) allows rollback without redeploying infrastructure. Modules are self-contained, shareable across teams, and form a growing catalog of 2,000+ reusable blocks with 200k+ monthly iterations.

  2. Local dev loop as productivity lever — Python module builds reduced from minutes to under 30 seconds using local builds + remote developer environments (RDEs). Local builds now represent a large share of daily builds, saving thousands of developer minutes per week.

  3. Workflow Orchestrator as central scheduling substrate — Supports effortless cloning/rerun of prior runs, flexible triggering (portal, CLI, API), composable nested/joined workflows, hot clusters for rapid iteration, and CRON-based scheduling. Manages ~120k monthly workflow runs across 100+ ML teams.

  4. Hot clusters eliminate provisioning latency — Pre-provisioned clusters remain active between runs, eliminating cluster spin-up wait time and enabling rapid iterative experimentation.

  5. Automatic deterministic caching — Detects when a task has already run with identical parameters/inputs and reuses stored results. ~80% of workflows leverage caching daily, saving 1,000+ hours of execution time per month.

  6. Multi-layer compliance embedded in execution — Three-layer governance: user identity-based access control, domain-level access control (experimentation vs. production), and column-level data classification with automatic tag propagation through pipeline stages. Default-deny for unclassified columns.

  7. Experimental workflows as productivity multiplier — Account for over half of all ML Studio runs; direct access to approved datasets without PR review. Saves 100+ hours per day across Atlassian's Central AI org.

  8. Cross-functional integration layer — ML Studio integrates with experiment tracking, central feature store, model registry, monitoring (ML Lens), deployment/serving platform, and other microservices via APIs. This reduces context-switching from experiment to production deployment.

Operational Numbers

Metric Value
Monthly active Rovo users served 5 million+
Datasets generated (with access control) 900k+
Monthly workflow runs ~120k
ML teams using platform 100+
Monthly model iterations/experiments ~20k
Reusable ML modules 2,000+
Monthly module iterations 200k+
Workflows leveraging caching ~80% daily
Execution time saved by caching 1,000+ hours/month
Experimental workflow share >50% of all runs
Daily hours saved by experimental workflows 100+
Local build time (Python modules) <30 seconds

Systems Extracted

Concepts Extracted

Patterns Extracted

Caveats

  • Tier-3 source — Atlassian's blog mixes product marketing with architecture; this post is more architectural than typical.
  • No details on failure handling, retry semantics, or cluster autoscaling strategy.
  • No queue/backpressure design for the 120k monthly workflow runs.
  • No specifics on GPU cluster topology, distributed training frameworks, or training job scheduling algorithms.
  • "Hot clusters" mentioned without details on idle-timeout policy, cost management, or multi-tenant sharing.
  • Caching correctness guarantees not specified (hash algorithm, invalidation policy for side-effecting tasks).
  • Integration layer described at feature level, not at wire-protocol or API-contract level.
  • Fortune 500 adoption claims are marketing metrics, not architecture metrics.

Source

Last updated · 542 distilled / 1,571 read