Skip to content

Slack

Slack Engineering blog (slack.engineering). Tier-2 source on the sysdesign-wiki. Slack is a workplace-messaging platform (acquired by Salesforce in 2020) with substantial engineering output across backend (Flannel, Vitess-for-Slack), mobile (cross-platform client architecture), frontend (TypeScript/ React at large scale, shared-edit collaboration), and developer infrastructure (CI, test frameworks, migration tooling).

This wiki's coverage of Slack spans seven axes so far:

  1. Developer-productivity tooling at scale — Slack's public retrospective on using LLMs to automate a 15,000-test Enzyme → RTL migration, which canonicalised a reusable AST + LLM hybrid conversion pattern.
  2. Reliability engineering at scale — Slack's 18-month Deploy Safety Program (mid-2023 → Jan 2025) that reduced customer impact hours from change-triggered incidents by 90%, canonicalised in the 2025-10-07 retrospective.
  3. Test-framework integration at scale — Slack's 2022- launched integration of Axe Core accessibility checks into the existing Playwright E2E suite as a custom- fixture extension, canonicalising several reusable patterns (fixture-extension as integration surface, two-axis exclusion-list, severity-gated reporting, tri-mode opt-in execution, alert-to-Jira workflow).
  4. Mobile accessibility engineering — Slack's 2024 third- party VPAT audit of the IA4-redesigned Android client; 8 recurring themes (7 resolved, 1 deferred) with fixes concentrated at Slack Kit component- library layer. Canonicalised the patterns/accessibility-delegate-override-for-semantic-fix pattern (Slack's new SKListAccessibilityDelegate fixing CollectionInfo for decorative dividers), the patterns/custom-talkback-actions-as-gesture-alternative pattern (workspace-switcher drag-reorder via TalkBack "Move before" / "Move after" actions + six-dot drag handle + Edit mode), and the patterns/vpat-driven-a11y-triage four-bucket workflow. Complements the 2025-01-07 automated-Axe ingest as the manual / third-party / periodic layer of the same broader a11y strategy (see concepts/automated-vs-manual-testing-complementarity).
  5. Fleet-configuration-management at scale — Slack's 2025-10-23 Chef phase-2 post canonicalises the AZ-bucketed environment split, the signal-driven fleet-config-apply pipeline (Chef Librarian → S3 → Chef Summoner), the release-train-with-canary rollout, and the self-update- with-independent-fallback-cron pattern — one-level-below the Deploy Safety Program's altitude, in the EC2 / Chef substrate specifically.
  6. Build-system engineering at scale — Slack Quip/Canvas team's 60 min → 10 min (~6×) build speed-up via Bazel + Starlark adoption, canonicalised in the 2025-11-06 retrospective. The load-bearing insight: Bazel gives nothing to a build whose graph has cycles / non-hermetic actions / coarse cache keys; the real wins came from applying classical separation-of-concerns and layering principles to build code itself. Canonicalised three new patterns — patterns/decouple-frontend-build-from-backend-artifacts, patterns/delete-inner-parallelization-inside-outer-orchestrator, patterns/diff-artifact-validator-for-build-refactor — plus the concepts/cache-granularity and concepts/idempotent-build-action concepts.
  7. Edge-networking and synthetic monitoring — Slack's 2026-03-31 retrospective on rolling out HTTP/3 on the public edge. Closed the HTTP/3 probing gap — neither SaaS observability tools nor Slack's Prometheus Blackbox Exporter fleet ("a cornerstone of our monitoring") natively spoke QUIC/UDP before the intern project. Intern Sebastian Feliciano scoped, implemented, and open-sourced QUIC support into Prometheus Blackbox Exporter upstream using systems/quic-go as the client library, then built an in-house integration on the same branch so Slack could ship HTTP/3 probing to production before upstream merge — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Extends the patterns/upstream-fixes-to-community pattern with a new altitude: upstream-a-whole-new-feature, distinct from the Shopify × Reanimated fix-existing- feature-at-scale altitude. Canonicalised Slack's explicit "monitor first, migrate second" takeaway as the concepts/observability-before-migration concept. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics in one Grafana "single pane of glass".
  8. Security-engineering + AI-agent operations — Slack Security Engineering team's 2025-12-01 retrospective on Spear, their multi-agent security-investigation service that triages detection- system alerts during on-call shifts. Opens the first post in a promised series. Canonical first wiki instance of the patterns/director-expert-critic-investigation-loop pattern (three-persona agent team: Director plans / progresses phases / writes final report, four Experts — Access / Cloud / Code / Threat — produce domain-specific findings, Critic audits + condenses). Canonical first wiki instance of the concepts/knowledge-pyramid-model-tiering concept (Experts on cheap models, Critic on mid-tier, Director on top-tier). Canonical first wiki instance of the patterns/hub-worker-dashboard-agent-service pattern (Hub + Worker + Dashboard productisation shape). Load-bearing architectural lesson verbatim: "prompts are just guidelines; they're not an effective method for achieving fine-grained control" — canonical statement of concepts/prompt-is-not-control. Slack's canonical emergent-behaviour worked example: Critic caught a credential exposure the Expert missed, Director pivoted the investigation. Canonicalised four concepts (concepts/knowledge-pyramid-model-tiering, concepts/investigation-phase-progression, concepts/weakly-adversarial-critic, concepts/prompt-is-not-control) + four patterns (patterns/director-expert-critic-investigation-loop, patterns/one-model-invocation-per-task, patterns/hub-worker-dashboard-agent-service, patterns/phase-gated-investigation-progression) + one system (systems/slack-spear).

Key systems

  • systems/slack-deploy-safety-program — the 18-month reliability program; 90% reduction in customer impact hours; 10-min automated MTTR / 20-min manual MTTR / detect-before-10%- fleet North Stars; "imperfect analog of customer sentiment" program metric; exec-sponsored; OKR-weighted.
  • systems/slack-releasebot — Slack's 2018-era metrics-based- deploy + automatic-rollback orchestrator for Webapp backend; the blueprint the 2023-2025 centralised orchestration system generalises across substrates.
  • systems/slack-bedrock — Slack's internal compute platform over Kubernetes; the substrate on which the first fully- automated metrics-based-deploy-with-auto-rollback regime was built.
  • systems/slack-spear — Slack's multi-agent security- investigation service that triages detection-system alerts during on-call shifts. Three-persona agent team (Director / 4 Experts / Critic) running across three phases (discovery / trace / conclude) on a Hub + Worker + Dashboard service architecture. Name inferred from image-asset URL slugs; post uses "our service".
  • systems/slack-shipyard — Slack's upcoming successor to the legacy Chef-based EC2 platform. Preview-only in the 2025-10-23 post; service-level deployments, metric-driven rollouts, fully-automated rollbacks; soft launch Q4 2025 with two pilot teams. For teams that can't yet move to Bedrock.
  • systems/chef / systems/chef-librarian / systems/chef-summoner / systems/poptart-bootstrap / systems/chef-policyfiles — Slack's legacy EC2 fleet- configuration substrate and its phase-2 components (the Policyfiles alternative was explicitly rejected on blast- radius-of-change grounds).
  • systems/enzyme-to-rtl-codemod — Slack's open-sourced (@slack/enzyme-to-rtl-codemod) AI-powered test-conversion pipeline: AST codemod handles deterministic Enzyme→RTL rewrites + writes in-code annotation comments for hard cases, rendered DOM is captured per-test- case by instrumenting Enzyme's mount/shallow, and an LLM (Anthropic Claude 2.1 at time of post) consumes the annotated file + DOM + structured prompt to finish the conversion. ~80% quality on evaluation files; ~64% adoption across Slack's RTL migration.
  • systems/enzyme — testing framework Slack was migrating away from (no React 18 adapter).
  • systems/react-testing-library — target framework.
  • systems/claude-2-1 — LLM backend used in the original pipeline (via Slack's internal DevXP AI team integration).
  • systems/jest — underlying test runner.
  • systems/playwright — Slack's E2E framework. Used as the integration substrate for Axe accessibility checks via the custom-fixture extension pattern.
  • systems/axe-core / systems/axe-core-playwright — Deque Systems' accessibility rule engine and its Playwright binding. Slack's 2022-launched integration into the Playwright E2E suite.
  • systems/buildkite — Slack's CI orchestrator; hosts the nightly a11y regression run (one leg of Slack's tri-mode opt-in execution pattern).
  • systems/jira — Slack's ticket-tracking tool; receives auto-created tickets from the Slack a11y alert-channel workflow (patterns/alert-channel-to-jira-auto-ticket-workflow).
  • systems/slack-android — Slack's native Android client; anchors the 2024 VPAT retrospective's mobile-a11y disclosures. Phone-only (no large-form-factor support yet).
  • systems/slack-kit — Slack's shared mobile component library (SK). Components surfaced on the wiki so far: OutlinedTextField, SKBanner, SKList / SKListAdapter, SK divider, and the SKListAccessibilityDelegate introduced by the VPAT resolution.
  • systems/talkback — Android's built-in screen reader; the primary assistive-tech target for Slack Android's a11y work.
  • systems/android-accessibility-framework — Android platform a11y APIs (AccessibilityDelegate, AccessibilityNodeInfoCompat, CollectionInfo, custom- action API) that the Slack Kit fixes plug into.
  • systems/prometheus-blackbox-exporter"a cornerstone of our monitoring" at Slack; canonical client-side black- box probing substrate for edge endpoints. Slack intern Sebastian Feliciano open-sourced the http3 module into this upstream project to close the HTTP/3 probing gap, built on systems/quic-go.
  • systems/quic-go — Go QUIC library Slack's BBE http3 contribution is built on; "wide adoption across other open source technologies, as well as the first- class support it provides in creating http clients in go" was the selection rationale.
  • systems/prometheus — the TSDB backing Slack's edge-probing metrics pipeline.
  • systems/grafana — Slack's "single pane of glass" dashboarding layer; canonically unifies HTTP/1.1 + HTTP/2
  • HTTP/3 probe metrics side-by-side post-BBE-contribution.
  • systems/enterprise-grid / systems/unified-grid / systems/slack-rtm — Slack's tenancy substrate before / after / pre-requisite for the 2024 Unified Grid re-architecture.
  • systems/slack-quip / systems/slack-canvas — the collaborative-document + in-Slack canvas surfaces whose shared Python-backend + TypeScript-frontend build pipeline is the subject of Slack's 2025-11-06 build-system retrospective (60 min → 10 min).
  • systems/bazel / systems/starlark — the build system and its constrained DSL adopted by the Quip/Canvas team; central to the 60 min → 10 min story. Slack's framing: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — adopting Bazel alone achieves nothing; the engineering work to meet its preconditions is where the speed-up lives.

Key patterns / concepts

Recent articles

Last updated · 470 distilled / 1,213 read