SLACK 2025-11-06

Slack — Build better software to build software better¶

Summary¶

Slack's Quip/Canvas team took their monorepo build from 60 minutes to as low as 10 minutes (cached & parallelised) — a ~6× speed-up — by (1) adopting Bazel and (2) doing the unglamorous engineering work required to actually benefit from it. The core lesson is that throwing Bazel at a tangled build gives you nothing: a build graph with cycles, giant non-hermetic action nodes, and cache keys containing hundreds of parameters will have a zero cache hit rate no matter what build tool wraps it. The real levers were classical software-engineering principles — separation of concerns and layering — applied to the build code itself, which was authored in Starlark with hard isolation from application code. Two concrete wins: severing the dependency edge where every Python change invalidated the frontend cache (saved ~35 min / build) and deleting in-house parallelization inside the frontend bundler so that Bazel's own scheduler could parallelise at a finer grain.

Key takeaways¶

The pre-existing build was 60 minutes because it had the three properties that defeat any build system's caching and parallelism: cycles in the dependency graph, huge non-hermetic action nodes, and cache keys with hundreds of changing parameters. Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions — it gives nothing to a build that doesn't meet those preconditions. (Source: sources/2025-11-06-slack-build-better-software-to-build-software-better)
~35 min / build was wasted on one bad edge: the frontend build transitively depended on the built Python backend, so every Python change invalidated the frontend cache key and forced a full frontend rebuild. Severing that coupling — via rewriting Python build-orchestration code in Starlark, colocating build logic in BUILD.bazel files next to the units it builds, and removing any Python-standard-library-plus-app-code dependencies from retained scripts — was the single biggest win. "The complexity of the original build code made it challenging to define 'correct' behavior" because there were no tests; the team wrote a Rust diff tool that compared artifacts from the old and new build systems byte-for-byte to guide iteration. (Source: sources/2025-11-06-slack-build-better-software-to-build-software-better)
Cache granularity is the lever, not cache size. The original frontend builder took "all the sources" and produced "all the bundles" in one action, so any TypeScript file change invalidated the whole bundle set. Refactoring to one action per bundle, with TypeScript and CSS compiled independently and combined at the end, made every bundle cacheable on its direct inputs and massively raised the hit rate. This matches the code-level principle: smaller cache keys over finer units of work give higher hit rates. (Source: sources/2025-11-06-slack-build-better-software-to-build-software-better)
Delete the inner parallelizer when you wrap it in an outer one. The legacy frontend builder had its own worker-process pool. Once Bazel was managing parallelism across the whole graph, the inner pool (a) contended with Bazel for the same CPU/RAM budget and (b) could only parallelise within a single action, whereas Bazel can parallelise all bundle builds at once across machines. The team simplified the builder to one-bundle-in / one-bundle-out with no parallelization code — a win for maintainability and performance. This is a layering violation resolution: business logic was fused into orchestration and parallelization, and the boundary was redrawn so each layer has a single concern. (Source: sources/2025-11-06-slack-build-better-software-to-build-software-better)
Separation of concerns applies to build code, release code, and setup code — not just application code. Slack had three couplings to undo: backend ↔ frontend (Python artifacts being an input to TypeScript builds), Python ↔ TypeScript toolchains (one Python process orchestrating tsc + webpack), and application code ↔ build code (in-process Python calling application modules to do build work). Each coupling created blast-radius surprises for engineers and degraded build performance. The team's pitch: "the whole system is more than our application code. It's also our build code, our release pipeline, the setup strategies for our developer and production environments, and the interrelations between those components." (Source: sources/2025-11-06-slack-build-better-software-to-build-software-better)
Starlark is deliberately constrained to make Bazel's preconditions enforceable. The language's limitations (no arbitrary I/O, no mutable globals, no recursion beyond bounded patterns) exist to guarantee that build actions can be cached, hermetic, and reproducible. Slack explicitly leaned on this: when rewriting Python build-orchestration code into Starlark, the language forced them into shapes compatible with Bazel's caching model. (Source: sources/2025-11-06-slack-build-better-software-to-build-software-better)
A byte-for-byte artifact diff tool is the validation harness for a build-system migration. When application-code complexity means "correct" is defined as "whatever the existing build produces," you need a mechanical oracle to guide iteration. Slack wrote theirs in Rust; it compared artifacts produced by the old and new build systems and highlighted mismatches as the team iterated. This is a reusable pattern for build-refactor projects — see patterns/diff-artifact-validator-for-build-refactor. (Source: sources/2025-11-06-slack-build-better-software-to-build-software-better)

Operational numbers¶

Before: 60-minute builds, 100% of builds.
After:
Best case (cached + parallelised): 10 minutes — ~6× speed-up.
Average case (mostly cached + parallelised): 12 minutes — ~5× speed-up.
Worst case (cache miss): 30 minutes — ~2× speed-up even on full rebuild.
Frontend coupling cost: the one Python↔TypeScript dependency edge was costing ~35 min per build (more than half of the original 60 min).
Intermediate milestone: after severing backend↔frontend coupling but before refactoring the frontend builder internals, the whole application could build in 25 minutes with a cached frontend — a ~2.4× speed-up on its own.

Systems & concepts extracted¶

Systems

systems/bazel — the wrapper build system; hermetic, content- addressed, distributable.
systems/starlark — the constrained language Bazel uses for BUILD file definitions; Slack rewrote Python build-orchestration into Starlark.
systems/slack-quip — the shared document system whose backend pipeline Slack was rebuilding.
systems/slack-canvas — the in-Slack collaborative canvas surface whose frontend bundle shares the same build pipeline.

Concepts

concepts/build-graph — the declared DAG of actions whose properties determine what caching and parallelism you actually get.
concepts/hermetic-build — inputs fully declared, sandbox enforced; the precondition for sound caching.
concepts/idempotent-build-action — same inputs → same outputs, unconditionally; the second precondition for caching.
concepts/cache-hit-rate — the load-bearing metric for any build's cacheability; Slack's was zero when cache keys had hundreds of transitively-inherited parameters.
concepts/cache-granularity — the size of the unit whose inputs form a cache key; fine-grained caches hit more often because fewer inputs change per request.
concepts/separation-of-concerns — the classical principle Slack applied to build code, not just application code.
concepts/layering-violation — when one layer does work that belongs in another (business logic doing orchestration and parallelization); the structural diagnosis for Slack's frontend builder.

Patterns

patterns/decouple-frontend-build-from-backend-artifacts — cut transitive dependency edges between runtime-language build pipelines so one language's file change doesn't invalidate another language's cache.
patterns/delete-inner-parallelization-inside-outer-orchestrator — when wrapping an ad-hoc parallelized tool inside a work orchestrator (Bazel, Kubernetes, Ray, etc.), remove the inner parallelization so the outer layer can schedule optimally.
patterns/diff-artifact-validator-for-build-refactor — when build code has no tests and "correct" is defined by the incumbent system's outputs, write a byte-diff harness that compares old- vs new-system artifacts as the validation oracle during migration.

Caveats¶

The post doesn't publish absolute fleet sizes, number of Starlark rules written, or the size of the Rust diff tool — only the end-to- end build time delta.
"~6×" is the best-case (cached + parallelised) speed-up; the worst- case speed-up (cache-miss full rebuild, 30 min vs 60 min) is ~2×. The program-level gain therefore depends on the actual cache hit rate distribution in day-to-day builds.
The post doesn't describe the Bazel remote cache/execution setup — only that Bazel can distribute actions across CPU cores or a build cluster. It's unclear whether Slack is using concepts/remote-build-execution or only local parallelism.
The advice to "delete the inner parallelizer" depends on having an outer one that can parallelise at a finer grain. Without Bazel (or an equivalent), in-process parallelism is still the right lever.

Source¶

companies/slack — the company page.
systems/bazel — the build system.
systems/starlark — the constrained configuration language.
concepts/hermetic-build — the precondition for sound caching.
concepts/build-graph — the first-class DAG build systems operate on.
concepts/cache-hit-rate — the load-bearing metric whose zero value was the smoking gun.
concepts/cache-granularity — the size-of-the-cached-unit lever that Slack pulled.
concepts/separation-of-concerns — the classical principle applied to build code.
concepts/layering-violation — the structural diagnosis.
patterns/decouple-frontend-build-from-backend-artifacts — the ~35-min-saving refactor pattern.
patterns/delete-inner-parallelization-inside-outer-orchestrator — the layering-fix pattern.
patterns/diff-artifact-validator-for-build-refactor — the validation harness pattern for build-refactor migrations.