Skip to content

Slack

Slack Engineering blog (slack.engineering). Tier-2 source on the sysdesign-wiki. Slack is a workplace-messaging platform (acquired by Salesforce in 2020) with substantial engineering output across backend (Flannel, Vitess-for-Slack), mobile (cross-platform client architecture), frontend (TypeScript/ React at large scale, shared-edit collaboration), and developer infrastructure (CI, test frameworks, migration tooling).

This wiki's coverage of Slack spans eight axes so far:

  1. Developer-productivity tooling at scale — Slack's public retrospective on using LLMs to automate a 15,000-test Enzyme → RTL migration, which canonicalised a reusable AST + LLM hybrid conversion pattern.
  2. Reliability engineering at scale — Slack's 18-month Deploy Safety Program (mid-2023 → Jan 2025) that reduced customer impact hours from change-triggered incidents by 90%, canonicalised in the 2025-10-07 retrospective.
  3. Test-framework integration at scale — Slack's 2022- launched integration of Axe Core accessibility checks into the existing Playwright E2E suite as a custom- fixture extension, canonicalising several reusable patterns (fixture-extension as integration surface, two-axis exclusion-list, severity-gated reporting, tri-mode opt-in execution, alert-to-Jira workflow).
  4. Mobile accessibility engineering — Slack's 2024 third- party VPAT audit of the IA4-redesigned Android client; 8 recurring themes (7 resolved, 1 deferred) with fixes concentrated at Slack Kit component- library layer. Canonicalised the patterns/accessibility-delegate-override-for-semantic-fix pattern (Slack's new SKListAccessibilityDelegate fixing CollectionInfo for decorative dividers), the patterns/custom-talkback-actions-as-gesture-alternative pattern (workspace-switcher drag-reorder via TalkBack "Move before" / "Move after" actions + six-dot drag handle + Edit mode), and the patterns/vpat-driven-a11y-triage four-bucket workflow. Complements the 2025-01-07 automated-Axe ingest as the manual / third-party / periodic layer of the same broader a11y strategy (see concepts/automated-vs-manual-testing-complementarity).
  5. Fleet-configuration-management at scale — Slack's 2025-10-23 Chef phase-2 post canonicalises the AZ-bucketed environment split, the signal-driven fleet-config-apply pipeline (Chef Librarian → S3 → Chef Summoner), the release-train-with-canary rollout, and the self-update- with-independent-fallback-cron pattern — one-level-below the Deploy Safety Program's altitude, in the EC2 / Chef substrate specifically.
  6. Build-system engineering at scale — Slack Quip/Canvas team's 60 min → 10 min (~6×) build speed-up via Bazel + Starlark adoption, canonicalised in the 2025-11-06 retrospective. The load-bearing insight: Bazel gives nothing to a build whose graph has cycles / non-hermetic actions / coarse cache keys; the real wins came from applying classical separation-of-concerns and layering principles to build code itself. Canonicalised three new patterns — patterns/decouple-frontend-build-from-backend-artifacts, patterns/delete-inner-parallelization-inside-outer-orchestrator, patterns/diff-artifact-validator-for-build-refactor — plus the concepts/cache-granularity and concepts/idempotent-build-action concepts.
  7. Edge-networking and synthetic monitoring — Slack's 2026-03-31 retrospective on rolling out HTTP/3 on the public edge. Closed the HTTP/3 probing gap — neither SaaS observability tools nor Slack's Prometheus Blackbox Exporter fleet ("a cornerstone of our monitoring") natively spoke QUIC/UDP before the intern project. Intern Sebastian Feliciano scoped, implemented, and open-sourced QUIC support into Prometheus Blackbox Exporter upstream using systems/quic-go as the client library, then built an in-house integration on the same branch so Slack could ship HTTP/3 probing to production before upstream merge — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Extends the patterns/upstream-fixes-to-community pattern with a new altitude: upstream-a-whole-new-feature, distinct from the Shopify × Reanimated fix-existing- feature-at-scale altitude. Canonicalised Slack's explicit "monitor first, migrate second" takeaway as the concepts/observability-before-migration concept. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics in one Grafana "single pane of glass".
  8. Security-engineering + AI-agent operations — Slack Security Engineering team's 2025-12-01 retrospective on Spear, their multi-agent security-investigation service that triages detection- system alerts during on-call shifts. Opens the first post in a promised series. Canonical first wiki instance of the patterns/director-expert-critic-investigation-loop pattern (three-persona agent team: Director plans / progresses phases / writes final report, four Experts — Access / Cloud / Code / Threat — produce domain-specific findings, Critic audits + condenses). Canonical first wiki instance of the concepts/knowledge-pyramid-model-tiering concept (Experts on cheap models, Critic on mid-tier, Director on top-tier). Canonical first wiki instance of the patterns/hub-worker-dashboard-agent-service pattern (Hub + Worker + Dashboard productisation shape). Load-bearing architectural lesson verbatim: "prompts are just guidelines; they're not an effective method for achieving fine-grained control" — canonical statement of concepts/prompt-is-not-control. Slack's canonical emergent-behaviour worked example: Critic caught a credential exposure the Expert missed, Director pivoted the investigation. Canonicalised four concepts (concepts/knowledge-pyramid-model-tiering, concepts/investigation-phase-progression, concepts/weakly-adversarial-critic, concepts/prompt-is-not-control) + four patterns (patterns/director-expert-critic-investigation-loop, patterns/one-model-invocation-per-task, patterns/hub-worker-dashboard-agent-service, patterns/phase-gated-investigation-progression) + one system (systems/slack-spear).
  9. Preference-architecture / notification engineering at scale — Slack's 2026-03-19 retrospective on the Notifications 2.0 rebuild that migrated millions of users from four conflicting preference models (desktop / mobile × content-selection / delivery-channel conflated into one enum each) to a single unified hierarchy (All new messages / Mentions and DMs / Mute × independent desktop + mobile push toggles × explicit Advanced override dimension) — without a database-level backfill. The load-bearing architectural move is patterns/read-time-schema-translation: a new desktop_push_enabled boolean was added and backfilled from the legacy "off" value; at every read site, legacy desktop: 'off' is translated to desktop: 'mentions'
  10. desktop_push_enabled: false so behaviour is byte-identical but expressible in the new decoupled schema. Rollback is automatic because storage bytes never changed. Canonicalises three patterns (patterns/read-time-schema-translation, patterns/decouple-what-from-how-in-preferences, patterns/unified-preference-model-for-cross-client-state)
  11. seven concepts (concepts/read-time-preference-translation, concepts/preference-schema-decoupling, concepts/cross-platform-preference-parity, concepts/mental-model-preference-coherence, concepts/support-burden-as-architecture-signal, concepts/explicit-state-over-implicit-sync, concepts/auto-save-modal-ux-coherence) + one system (systems/slack-notifications-2-0). Notifications were "one of the top three drivers of Customer Experience tickets" pre-project — support burden as architecture signal canonicalised. Post- launch: 5× increase in settings engagement sustained for weeks; majority chose "Mentions and DMs" default; fewer per-channel overrides needed. One disclosed data-integrity incident: "A malformed field once reset preferences to Mentions until we cleaned data and flushed memcache" — surfaces the thin-validation + aggressive-caching + default-fallback compound failure shape of heavily- cached preference stores.
  12. Data-platform security & substrate modernisation — Slack's 2026-05-05 retrospective on eliminating SSH entirely from its EMR data pipelines. 700+ production jobs across 8 independent data regions, 7 operator types, 5 teams, 3 quarters, zero downtime. The architectural unblocker was a single REST gateway — Quarry — that fronts YARN / Trino / Snowflake from one auth surface and one log schema. The breakthrough enabler for shell-class jobs was discovering that YARN Distributed Shell was already a REST-submittable shell runner that could replace 300+ CLI-based SSH jobs without a custom remote-execution service. SSH's costs surfaced as structural blockers — the article verbatim: "We were blocked: Couldn't start the path for Spark on Kubernetes nor EMR on EKS […] Couldn't complete our Whitecastle initiative because we needed to move the last main-account EMR clusters to child accounts." Two named challenges illuminate the latent failure modes SSH was hiding: vmem-check failures (jobs that ran fine via SSH had been silently exceeding YARN's resource limits — fix: AWS-recommended yarn.nodemanager.vmem-check-enabled: false) and an EKM connectivity timeout during cross-cluster migration that "surfaced a hidden dependency on network topology that wasn't captured in the job's configuration." The disciplined rollout used patterns/incremental-operator-by-operator-migration (CrunchExecOperator, S3SyncOperator, etc., one at a time as mini-projects) over 5 phases (POC → Security Review → OKR-Driven Execution → Bulk Migration → Final Cleanup) tracked via an analytics dashboard backed by Airflow metadata-DB queries — "Build monitoring before you migrate." Canonicalises three patterns (patterns/rest-gateway-for-compute-engine-job-submission, patterns/yarn-distributed-shell-as-universal-shell-executor, patterns/incremental-operator-by-operator-migration), four concepts (concepts/rest-based-job-submission, concepts/ssh-job-execution-anti-pattern, concepts/resource-enforcement-bypass-via-ssh, concepts/master-node-resource-contention), and three systems (systems/slack-quarry, systems/yarn-distributed-shell, systems/apache-yarn). Extends concepts/observability-before-migration (the project-progress altitude — Airflow-DB-backed dashboard tracks remaining SSH jobs per region per team — sibling to the prior HTTP/3 transport-probe altitude), concepts/attack-surface-minimization (the structural-blocker framing of "can't modernise anything until SSH dies"), concepts/long-lived-key-risk (eliminates the long-lived-SSH-key class at industrial scale via service-token replacement), and concepts/audit-trail (Quarry's per-submission structured logs are the new audit substrate — "No more 'who ran that command?' mysteries").
  13. Multi-cloud LLM serving (Slack AI infrastructure) — Slack's 2026-05-28 retrospective on the three-year evolution of the Slack AI LLM serving substrate from single-region AWS SageMaker (early 2023, with escrow VPC for Anthropic) → fully managed Amazon Bedrock Provisioned Throughput (mid-2024, zero-incident migration — canonical patterns/zero-incident-llm-migration) → Hybrid PT+OD with spillover (mid-2025, canonical patterns/provisioned-throughput-with-on-demand-spillover) → multi-cloud Bedrock + GCP Vertex AI (early 2026). The architectural endpoint is Slack's Intelligent Routing Layer — five subsystems hidden behind a unified internal API: metric-driven model selection (per-feature primary + designated backup), A/B experimental routing (feedback loop "weeks → days"), automated circuit breaker with partial-open recovery state (TTFT
  14. p90 latency + 5xx error rate + customer-feedback trends as triggers), API normalisation layer, secretless cross-cloud authentication. Disclosed Phase 4 outcomes: ~10% quality lift on complex reasoning + ~67% latency reduction on high-velocity / low-token workloads via per-feature model-to-feature binding. Three-year arc canonicalises four structural failure modes that drove the evolution: model feature lag (Phase 1 → 2), over- provisioning cycle + commitment lock-in (Phase 2 → 3), concentration risk on single-cloud LLM serving (Phase 3 → 4). Slack adopts the MU primitive on the customer side (sibling to Databricks' platform-side coining of the same primitive). Five reflections verbatim: "Scaling safely requires XFN parity / The abstraction layer is a core requirement / Treat architecture as a living document / Reliability requires provider agnosticism / Redefining the meaning of 'Failure'". Canonicalises 3 systems (systems/slack-ai, systems/slack-intelligent-routing-layer, systems/gcp-vertex-ai), 6 concepts (concepts/multi-cloud-llm-serving, concepts/escrow-vpc-llm-serving, concepts/llm-model-feature-lag, concepts/provisioned-throughput-vs-on-demand-llm, concepts/llm-over-provisioning-cycle, concepts/llm-provider-commitment-lock-in, concepts/api-normalization-multi-cloud-llm, concepts/model-to-feature-binding, concepts/concentration-risk-single-cloud-llm, concepts/automated-circuit-breaker-with-partial-open-state), and 5 patterns (patterns/multi-cloud-llm-serving, patterns/provisioned-throughput-with-on-demand-spillover, patterns/api-normalization-layer-cross-provider, patterns/model-fallback-hierarchy-with-circuit-breaker, patterns/zero-incident-llm-migration). Extends systems/aws-sagemaker-ai (Phase 1 escrow-VPC face), systems/amazon-bedrock (Phase 2/3 substrate face + PT+OD primitive disclosure + Slack's customer-side MU adoption), concepts/model-units (customer-side adoption sibling to Databricks platform-side coining), patterns/circuit-breaker (partial-open ramp refinement). Tenth Slack coverage axis on the wiki — structurally one altitude above all prior Slack AI work (the developer-productivity LLM-codemod axis from 2024-06-19, the security-investigation Spear axis from 2025-12-01, the agent-context-engineering axis from 2026-04-13) — this is the infrastructure substrate on which all consumer-facing Slack AI features ride.

Key systems

  • systems/slack-ai — the Slack AI feature suite (channel summaries, Recap, AI Search, related generative / extractive surfaces). Production substrate at "millions of users" scale by early 2026. Built starting early 2023 on AWS SageMaker; migrated to Amazon Bedrock mid-2024 with zero customer-facing incidents; expanded to multi-cloud Bedrock + GCP Vertex AI in early 2026. FedRAMP Moderate maintained across all phases. Sits on top of Slack's internal Kubernetes substrate ( Slack Bedrock — not to be confused with Amazon Bedrock, the LLM service).
  • systems/slack-intelligent-routing-layer — Slack AI's internal LLM abstraction layer that fronts every model provider and exposes a unified internal contract. Five subsystems: metric-driven model selection (per-feature primary + designated backup), A/B experimental routing (feedback loop weeks → days), automated circuit breaker with partial-open recovery state (TTFT + p90 latency + 5xx error rate + customer-feedback trends as triggers), API normalisation layer (unifies error codes / rate-limits / telemetry / auth across providers), secretless cross-cloud authentication. Architectural endpoint of Slack's three-year Slack AI evolution. Production substrate behind ~10% quality lift on complex reasoning + ~67% latency reduction on high-velocity / low-token workloads.
  • systems/gcp-vertex-ai — Google Cloud's managed LLM platform; the second cloud Slack added alongside AWS Bedrock in early 2026 to break single-cloud concentration risk and access vendor-exclusive frontier models. Wiki's first canonical disclosure of Vertex AI as an enterprise multi-cloud LLM endpoint.
  • systems/amazon-bedrock — AWS's managed LLM platform; Slack's Phase 2 (mid-2024) substrate after migrating from self-managed SageMaker. PT for latency-sensitive features (channel summaries, AI Search), OD for bursty workloads (nightly Recap), spillover from PT to OD when reserved capacity ceiling is hit. Capacity primitive: Model Units (Slack adopts MUs on the customer side; sibling to Databricks' platform-side coining).
  • systems/aws-sagemaker-ai — AWS's managed ML platform; Slack's Phase 1 (early 2023) substrate. Hosted Anthropic models inside an escrow VPC establishing zero-knowledge between Slack and the provider. Multi-region with cross-region IAM, ODCR + cron scaling. Operational taxes (scaling latency, GPU scarcity, over-provisioning, model feature lag) drove Phase 2 migration to Bedrock.
  • systems/slack-deploy-safety-program — the 18-month reliability program; 90% reduction in customer impact hours; 10-min automated MTTR / 20-min manual MTTR / detect-before-10%- fleet North Stars; "imperfect analog of customer sentiment" program metric; exec-sponsored; OKR-weighted.
  • systems/slack-releasebot — Slack's 2018-era metrics-based- deploy + automatic-rollback orchestrator for Webapp backend; the blueprint the 2023-2025 centralised orchestration system generalises across substrates.
  • systems/slack-bedrock — Slack's internal compute platform over Kubernetes; the substrate on which the first fully- automated metrics-based-deploy-with-auto-rollback regime was built.
  • systems/slack-spear — Slack's multi-agent security- investigation service that triages detection-system alerts during on-call shifts. Three-persona agent team (Director / 4 Experts / Critic) running across three phases (discovery / trace / conclude) on a Hub + Worker + Dashboard service architecture. Name inferred from image-asset URL slugs; post uses "our service".
  • systems/slack-notifications-2-0 — Slack's rebuilt notification-preference system; migrated millions of users from four per-client preference enums (desktop / mobile × content-selection / delivery-channel conflated) to one unified hierarchy (All new messages / Mentions and DMs / Mute × independent desktop + mobile push toggles × Advanced override dimension) via read-time schema translation — no database backfill, byte-identical rollback. Post-launch: 5× settings engagement, majority picked "Mentions and DMs" default, fewer per-channel overrides. One disclosed data-integrity incident (malformed field + memcache served default values → silent preferences-reset).
  • systems/slack-shipyard — Slack's upcoming successor to the legacy Chef-based EC2 platform. Preview-only in the 2025-10-23 post; service-level deployments, metric-driven rollouts, fully-automated rollbacks; soft launch Q4 2025 with two pilot teams. For teams that can't yet move to Bedrock.
  • systems/slack-quarry — Slack's REST-based job-submission gateway between Airflow and multiple compute engines (YARN on EMR, Trino, Snowflake). Replaced 700+ SSH-based jobs across 8 data regions over 3 quarters with zero downtime — service-to-service tokens at the gateway edge, server-side job-state tracking surviving Kubernetes pod restarts, structured per-submission logs as the audit substrate. The breakthrough that made it viable for all job types was discovering YARN Distributed Shell was already a REST-submittable shell runner. Canonical wiki instance of patterns/rest-gateway-for-compute-engine-job-submission.
  • systems/yarn-distributed-shell — the YARN ApplicationMaster that runs arbitrary shell scripts in YARN containers via the standard YARN REST API. "A little-known feature […] already part of YARN, used the same REST APIs, and required no custom security layer." The breakthrough enabler for Quarry's universal job-submission promise.
  • systems/apache-yarn — the resource manager Quarry/Slack submits to; surfaced the latent resource-violations that SSH-to-master-node had been silently bypassing.
  • systems/chef / systems/chef-librarian / systems/chef-summoner / systems/poptart-bootstrap / systems/chef-policyfiles — Slack's legacy EC2 fleet- configuration substrate and its phase-2 components (the Policyfiles alternative was explicitly rejected on blast- radius-of-change grounds).
  • systems/enzyme-to-rtl-codemod — Slack's open-sourced (@slack/enzyme-to-rtl-codemod) AI-powered test-conversion pipeline: AST codemod handles deterministic Enzyme→RTL rewrites + writes in-code annotation comments for hard cases, rendered DOM is captured per-test- case by instrumenting Enzyme's mount/shallow, and an LLM (Anthropic Claude 2.1 at time of post) consumes the annotated file + DOM + structured prompt to finish the conversion. ~80% quality on evaluation files; ~64% adoption across Slack's RTL migration.
  • systems/enzyme — testing framework Slack was migrating away from (no React 18 adapter).
  • systems/react-testing-library — target framework.
  • systems/claude-2-1 — LLM backend used in the original pipeline (via Slack's internal DevXP AI team integration).
  • systems/jest — underlying test runner.
  • systems/playwright — Slack's E2E framework. Used as the integration substrate for Axe accessibility checks via the custom-fixture extension pattern.
  • systems/axe-core / systems/axe-core-playwright — Deque Systems' accessibility rule engine and its Playwright binding. Slack's 2022-launched integration into the Playwright E2E suite.
  • systems/buildkite — Slack's CI orchestrator; hosts the nightly a11y regression run (one leg of Slack's tri-mode opt-in execution pattern).
  • systems/jira — Slack's ticket-tracking tool; receives auto-created tickets from the Slack a11y alert-channel workflow (patterns/alert-channel-to-jira-auto-ticket-workflow).
  • systems/slack-android — Slack's native Android client; anchors the 2024 VPAT retrospective's mobile-a11y disclosures. Phone-only (no large-form-factor support yet).
  • systems/slack-kit — Slack's shared mobile component library (SK). Components surfaced on the wiki so far: OutlinedTextField, SKBanner, SKList / SKListAdapter, SK divider, and the SKListAccessibilityDelegate introduced by the VPAT resolution.
  • systems/talkback — Android's built-in screen reader; the primary assistive-tech target for Slack Android's a11y work.
  • systems/android-accessibility-framework — Android platform a11y APIs (AccessibilityDelegate, AccessibilityNodeInfoCompat, CollectionInfo, custom- action API) that the Slack Kit fixes plug into.
  • systems/prometheus-blackbox-exporter"a cornerstone of our monitoring" at Slack; canonical client-side black- box probing substrate for edge endpoints. Slack intern Sebastian Feliciano open-sourced the http3 module into this upstream project to close the HTTP/3 probing gap, built on systems/quic-go.
  • systems/quic-go — Go QUIC library Slack's BBE http3 contribution is built on; "wide adoption across other open source technologies, as well as the first- class support it provides in creating http clients in go" was the selection rationale.
  • systems/prometheus — the TSDB backing Slack's edge-probing metrics pipeline.
  • systems/grafana — Slack's "single pane of glass" dashboarding layer; canonically unifies HTTP/1.1 + HTTP/2
  • HTTP/3 probe metrics side-by-side post-BBE-contribution.
  • systems/enterprise-grid / systems/unified-grid / systems/slack-rtm — Slack's tenancy substrate before / after / pre-requisite for the 2024 Unified Grid re-architecture.
  • systems/slack-quip / systems/slack-canvas — the collaborative-document + in-Slack canvas surfaces whose shared Python-backend + TypeScript-frontend build pipeline is the subject of Slack's 2025-11-06 build-system retrospective (60 min → 10 min).
  • systems/bazel / systems/starlark — the build system and its constrained DSL adopted by the Quip/Canvas team; central to the 60 min → 10 min story. Slack's framing: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — adopting Bazel alone achieves nothing; the engineering work to meet its preconditions is where the speed-up lives.

Key patterns / concepts

Recent articles

  • 2026-05-28 — sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud (Slack AI infrastructure team retrospective on the three-year evolution of the Slack AI LLM serving substrate from single-region AWS SageMaker (early 2023) → fully managed Amazon Bedrock PT (mid-2024) → Hybrid PT+OD with spillover (mid-2025) → multi-cloud Bedrock
  • GCP Vertex AI (early 2026). Architectural endpoint: Slack's Intelligent Routing Layer with five subsystems (metric-driven model selection / A-B experimental routing / automated circuit breaker with partial-open recovery / API normalisation / secretless cross-cloud auth) hidden behind a unified internal API. Disclosed Phase 4 outcomes: ~10% quality lift on complex reasoning + ~67% latency reduction on high-velocity / low-token workloads via per-feature model-to-feature binding. Three-year arc canonicalises four structural failure modes that drove the evolution: model feature lag (Phase 1 → 2; verbatim "weeks or months before SageMaker availability" for Anthropic models on Bedrock vs SageMaker), over-provisioning cycle (Phase 2 → 3; "persistent efficiency gap" between US-coast peak and APAC/EU mornings; 10× peak/off-peak variance for OD-suitable workloads), commitment lock-in (Phase 2 → 3; verbatim "a state-of-the-art model can be superseded in weeks" but PT contracts run 1–6 months; post-OD migration cadence months → 1 day), concentration risk on single-cloud LLM serving (Phase 3 → 4; "no matter how many failovers we engineered within a single cloud, we remained susceptible to any potential provider-wide outage"). Phase 1 substrate: escrow VPC for Anthropic models on SageMaker, "a strict zero-knowledge environment: our data remained private to Slack, and the provider's proprietary model weights remained inaccessible to us". Phase 2 migration: zero-incident playbook — Compliance → Capacity → Quality → Rollout. Phase 3 hybrid: PT-with-OD-spillover verbatim — "if a sudden surge pushed us beyond our reserved limits, excess requests automatically 'spilled over' to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings". Phase 4 routing: per-feature model bindings + automated circuit breaker with partial-open recovery state verbatim — "the breaker enters a partial-open state, allowing a small, controlled trickle of requests to reach the degraded endpoint. As the endpoint demonstrates sustained health, the system dynamically expands this trickle, incrementally ramping traffic back up". Three named breaker triggers: TTFT, p90 latency, 5xx error rate (plus "downward trend in customer feedback" as a soft-signal trigger). Slack adopts the MU primitive on the customer side — sibling to Databricks' platform- side coining of the same primitive in the 2026-05-27 Reliable LLM Inference at Scale post; verbatim "Each MU provides a deterministic amount of throughput, measured in tokens per minute. Shifting from GPU instances to MUs allowed us to abstract away the hardware and focus entirely on raw throughput." Five reflections verbatim: "Scaling safely requires XFN parity / The abstraction layer is a core requirement / Treat architecture as a living document / Reliability requires provider agnosticism / Redefining the meaning of 'Failure'" — the last canonicalising soft-failures ("An LLM service that is 'up' but slow is effectively broken") as first-class routing-layer triggers. Four operational taxes of multi-cloud disclosed verbatim: API + behavioural friction, operational monitoring complexity, attribution challenge, on-call knowledge gap. Canonicalises 3 systems (systems/slack-ai, systems/slack-intelligent-routing-layer, systems/gcp-vertex-ai), 6 concepts (concepts/multi-cloud-llm-serving, concepts/escrow-vpc-llm-serving, concepts/llm-model-feature-lag, concepts/provisioned-throughput-vs-on-demand-llm, concepts/llm-over-provisioning-cycle, concepts/llm-provider-commitment-lock-in, concepts/api-normalization-multi-cloud-llm, concepts/model-to-feature-binding, concepts/concentration-risk-single-cloud-llm, concepts/automated-circuit-breaker-with-partial-open-state), and 5 patterns (patterns/multi-cloud-llm-serving, patterns/provisioned-throughput-with-on-demand-spillover, patterns/api-normalization-layer-cross-provider, patterns/model-fallback-hierarchy-with-circuit-breaker, patterns/zero-incident-llm-migration). Extends 4 pages: systems/aws-sagemaker-ai (Phase 1 escrow-VPC face), systems/amazon-bedrock (Phase 2/3 substrate face
  • PT/OD primitive disclosure + Slack customer-side MU adoption), concepts/model-units (customer-side adoption sibling to Databricks' platform-side coining), patterns/circuit-breaker (partial-open ramp refinement for LLM serving). Tenth Slack coverage axis on the wiki — the LLM serving infrastructure substrate one altitude above the consumer-facing Slack AI features (developer-productivity LLM codemod, security-investigation Spear, agent-context-engineering, etc.) all of which ride on this substrate. Tier-2 on-scope decisively: multi-cloud LLM serving architecture is genuine distributed-systems / scaling-trade-offs / production- capacity-management content. Sibling at the LLM-serving- substrate altitude to Databricks' Reliable LLM Inference retrospective (sources/2026-05-27-databricks-reliable-llm-inference-at-scale, same-week ingest) — same primitives (MUs, cost-based routing, fallback hierarchy), different altitude (Slack is the customer; Databricks is the platform). Caveats: no specific provider model SKUs / no dollar figures / PT-vs-OD savings characterised as "substantial" without concrete numbers / circuit-breaker thresholds qualitative / per-feature primary/backup bindings not enumerated / cross-cloud traffic share not disclosed / specific secretless-auth federation shape not specified / retrospective framing on a clean four-phase progression with no acknowledgement of false starts.)

  • 2026-05-05 — sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines (Slack data-platform team retrospective on eliminating SSH entirely from EMR data pipelines — 700+ production jobs across 8 independent data regions, 7 operator types, 5 teams, 3 quarters, zero downtime. The architectural unblocker: a single REST gateway — Quarry — fronting YARN / Trino / Snowflake from one auth surface and one log schema; the architecture shift verbatim: Airflow → SSH Connection → EMR Master Node → Execute CommandAirflow → Quarry REST API → YARN ResourceManager → EMR Container. The breakthrough enabler for the 300+ shell-class jobs (aws s3 sync, hadoop distcp, custom Python scripts) was discovering that YARN Distributed Shell"a little-known feature […] already part of YARN" — could run arbitrary shell scripts in YARN containers via the standard YARN REST API, with no custom remote-execution service to build. Canonicalised as patterns/yarn-distributed-shell-as-universal-shell-executor. SSH's costs surfaced as structural blockers: "Couldn't start the path for Spark on Kubernetes nor EMR on EKS […] Couldn't complete our Whitecastle initiative because we needed to move the last main-account EMR clusters to child accounts." Two named challenges illuminate the latent failure modes SSH was hiding: vmem-check failures (jobs ran fine via SSH had been silently exceeding YARN's virtual-memory limits — fix: AWS-recommended yarn.nodemanager.vmem-check-enabled: false because "virtual memory accounting in Linux can be unreliable, and physical memory limits are sufficient" — canonicalised as concepts/resource-enforcement-bypass-via-ssh) and an EKM connectivity timeout during cross-cluster migration that "surfaced a hidden dependency on network topology that wasn't captured in the job's configuration" — fix: move jobs to clusters with the right routing rather than punching holes through the network. Honest disclosure on multi-region overhead: "effectively 8 parallel migrations, each with its own special set of challenges. […] The effort isn't just N times harder. It's N times harder with unique failure modes for each region." Disciplined rollout used patterns/incremental-operator-by-operator-migration (CrunchExecOperator, S3SyncOperator, etc., one at a time as mini-projects) over 5 phases — Proof of Concept → Security Review → OKR-Driven Execution (Phase 3 made the migration a Key Result with executive visibility; hit 80% milestone) → Bulk Migration → Final Cleanup — tracked via an analytics dashboard backed by Airflow metadata-DB queries identifying remaining SSH-based tasks per team / per DAG / per region. Verbatim best-practice: "Build monitoring before you migrate" — extends concepts/observability-before-migration to the project-progress altitude (Airflow-DB-backed burndown), sibling to the 2026-03-31 HTTP/3 transport-probe altitude. Operational improvements post-migration verbatim: "Master node resource contention: eliminated. […] Job reliability: dramatically improved. Jobs survive client Kubernetes pod restarts because Quarry maintains server-side job tracking. No more zombie processes." Security improvements verbatim: "replaced SSH key distribution with service-to-service token authentication, and gained proper audit trails through REST API logging. […] No more 'who ran that command?' mysteries." Two-year retrospective verdict: "With two years of production experience since completion, the architectural decisions have proven sound. […] No regrets." Canonicalises 3 systems (systems/slack-quarry, systems/yarn-distributed-shell, systems/apache-yarn), 4 concepts (concepts/rest-based-job-submission — the paradigm shift; concepts/ssh-job-execution-anti-pattern — what's being replaced; the two failure-mode concepts named above), and 3 patterns (patterns/rest-gateway-for-compute-engine-job-submission, patterns/yarn-distributed-shell-as-universal-shell-executor, patterns/incremental-operator-by-operator-migration). Extends concepts/attack-surface-minimization (with the "can't modernise anything until SSH dies" structural-blocker framing — complement to Meta's "feature gating" WhatsApp framing), concepts/long-lived-key-risk (industrial-scale elimination of the long-lived-SSH-key class via service-token replacement), and concepts/audit-trail (Quarry's per-submission structured logs as the new audit substrate). Caveats: no exact $-savings figure, no public Quarry open-source release, no Trino/Snowflake adapter internals disclosed, no quantified incident-rate delta, no token-rotation cadence at 700-jobs × 8-regions scale.)

  • 2026-04-13 — sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications (Slack Security Engineering team, series part 2 to the 2025-12-01 Spear post. Canonicalises how Spear manages context across long-running multi-agent investigations that can "span hundreds of inference requests and generate megabytes of output". Core architectural claim: three complementary context channels replace raw message history — Director's Journal (typed working memory: decision / observation / finding / question / action / hypothesis + phase + round + timestamp; read by all agents, written only by Director), Critic's Review with the four-tool introspection suite (get_tool_call, get_tool_result, get_toolset_info, list_toolsets) and a 5-level credibility rubric scored against disclosed distribution of 170,000 findings (37.7% Trustworthy / 25.4% Highly-plausible / 11.1% Plausible / 10.4% Speculative / 15.4% Misguided — 25.8% sub-plausibility rate), and the Critic's Timeline that implements narrative- coherence assembly with four explicit consolidation rules, a second 5-level rubric, and top-3 gap identification across three gap types (evidential / temporal / logical). Canonical load-bearing claim on narrative coherence as hallucination filter: "A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with." Also: no message history carry-forward between invocations — "Besides these resources, we do not pass any message history forward between agent invocations" — justified both by token-budget and by cognitive-load arguments (over-sharing "stifles creativity and encourages confirmation bias" even with infinite context). Specimen investigation disclosed: 6,046-event false-positive kernel-module-loading alert with 0.83 Timeline confidence and 3 identified gaps.)
  • 2025-12-01 — sources/2025-12-01-slack-streamlining-security-investigations-with-agents (Slack Security Engineering team retrospective on building Spear, their internal multi-agent security-investigation service that triages detection-system alerts during on-call shifts. First post in a promised series. Load-bearing architectural lesson verbatim: "prompts are just guidelines; they're not an effective method for achieving fine-grained control" — canonical statement of the new concepts/prompt-is-not-control concept after Slack's 300-word single-prompt prototype produced "wildly variable" quality. Prototype rewrite moved control out of the prompt into per-task model invocations with task-specific structured-output schemas (patterns/one-model-invocation-per-task) orchestrated by application code. Three-persona agent team (patterns/director-expert-critic-investigation-loop): Director (plans + progresses phases + writes final report; uses a journaling tool), 4 Experts (Access / Cloud / Code / Threat — each with unique tools + data sources), Critic ("meta-expert" auditing Experts against a rubric, scoring findings, condensing into a timeline). Three phases (patterns/phase-gated-investigation-progression): Discovery (broadcast to all Experts) → Trace (question one Expert, may vary model parameters) → Conclude (final report), with a Director-Decision meta-phase for transition decisions. Knowledge pyramid model tiering (concepts/knowledge-pyramid-model-tiering): verbatim "low, medium, and high-cost models for the expert, critic, and director functions, respectively." Tool-call-heavy Expert work runs on cheap models; rubric-application + condensation runs on mid-tier; strategic decisions on top-tier. Service architecture (patterns/hub-worker-dashboard-agent-service): Hub (API + persistent storage + metrics endpoint) + Worker (queue consumer, event-stream emitter, scalable) + Dashboard (real-time observe + replay + per-invocation debugging) replacing the prototype's coding-agent-CLI harness. Prototype exposed data sources via an MCP stdio server; production Worker's MCP persistence not explicitly disclosed. Critic's weakly-adversarial stance (concepts/weakly-adversarial-critic) verbatim: "The weakly adversarial relationship between the Critic and the expert group helps to mitigate against hallucinations and variability in the interpretation of evidence." Canonical worked example: the Critic caught a credential exposure the Expert missed during a process-ancestry review, the Director then pivoted the investigation to focus on the credential issue, final report surfaced both the security finding and the Expert's "analysis blind spots that require attention." Verbatim: "What is notable about this result is that the expert did not raise the credential exposure in its findings; the Critic noticed it as part of its meta-analysis of the expert's work." On-call shift mode-shift: "we're switching to supervising investigation teams, rather than doing the laborious work of gathering evidence." 9 canonical wiki primitives: source + 1 system (systems/slack-spear) + 4 concepts (concepts/knowledge-pyramid-model-tiering, concepts/investigation-phase-progression, concepts/weakly-adversarial-critic, concepts/prompt-is-not-control) + 4 patterns (patterns/director-expert-critic-investigation-loop, patterns/one-model-invocation-per-task, patterns/hub-worker-dashboard-agent-service, patterns/phase-gated-investigation-progression). Extends 5 pages: systems/model-context-protocol (new Seen-in — first wiki MCP instance inside an internal security-investigation pipeline), concepts/structured-output-reliability (new Seen-in — structured output as multi-agent orchestration-boundary contract), patterns/specialized-agent-decomposition (new Seen-in — canonical security-operations instance at the peer-Expert layer, with supra-agent Director + meta-agent Critic on top), patterns/multi-round-critic-quality-gate (new Seen-in — live-investigation altitude variant, distinguished from Meta's artifact-production rounds shape), patterns/drafter-evaluator-refinement-loop (new Seen-in — investigation-loop-with-third-layer variant, distinguished from Lyft's retry-only shape by the Director who decides progress/pivot/conclude). Scope disposition: Tier-2 on-scope decisively on multi-agent-architecture- canonicalisation grounds. Opens the Slack security- engineering axis on the wiki; first Slack security- operations ingest; seventh Slack coverage axis after developer-productivity, deploy-safety, test-framework- integration, mobile-a11y, fleet-config-management, build-systems, and edge-networking. Zero production numbers disclosed (no throughput / latency / cost / FP rate / token-usage); model families not disclosed; Critic's rubric opaque; Spear name inferred from image slugs not stated in post body. URL verbatim: https://slack.engineering/streamlining-security-investigations-with-agents/. Sibling to Cloudflare AI Code Review (patterns/coordinator-sub-reviewer-orchestration) at the code-review altitude; sibling to Redpanda Openclaw (patterns/four-component-agent-production-stack) at the enterprise-agent-substrate altitude; sibling to Lyft localization (patterns/drafter-evaluator-refinement-loop) at the structured-retry-loop altitude.)
  • 2026-03-19 — sources/2026-03-19-slack-how-slack-rebuilt-notifications (Slack Engineering retrospective on the Notifications 2.0 project — a ground-up redesign of Slack's notification preference system that migrated millions of users from four conflicting preference models to one unified hierarchy without a database-level backfill. The load-bearing mechanism is read-time schema translation: a new desktop_push_enabled boolean was added and backfilled from the legacy "off" value; at every read site, legacy desktop: 'off' is translated to desktop: 'mentions' + desktop_push_enabled: false — behaviour stays byte-identical, but now expressible in the new decoupled schema. Stored bytes never changed, so rollback is byte-identical safe. Canonical verbatim on why the database-backfill path was rejected: "With backwards compatibility and the possibility of rollback in mind, we thought it too risky to move people from 'off' to 'mentions' at the database level." The schema itself canonicalises decoupling what from how — the legacy enum conflated content selection (everything vs mentions vs nothing) with delivery channel (push on/off) into one axis; new schema splits into desktop: everything | mentions (what) + desktop_push_enabled: bool (how). Canonicalises one canonical preference hierarchy across clients with the Advanced section as the named override dimension for mobile-specific badge controls. Top-3 CX ticket driver pre-project canonicalises support burden as architecture signal: sustained high ticket volume concentrated on one feature is a diagnostic for structural architecture failure, not UX polish. Modal redesign switched from explicit Save button to auto-save — canonicalises auto-save as UX coherence for fine-grained independently-toggleable settings. Legacy "sync" parameter explicitly removed in favour of independent explicit values — canonical instance of concepts/explicit-state-over-implicit-sync: "Clarity beats cleverness. Removing the sync parameter and storing explicit desktop and mobile values made behavior predictable." Post-launch data: 5× settings-page engagement sustained for weeks (not one-time curiosity, active ongoing preference refinement); majority chose "Mentions and DMs" default; per-channel overrides decreased; push-toggle adoption "higher than expected". One disclosed production incident during migration: "A malformed field once reset preferences to Mentions until we cleaned data and flushed memcache" — surfaces the thin-validation + aggressive-caching + default-value-fallback compound failure shape of heavily-cached preference stores. Opens Slack's 8th coverage axis on the wiki: preference- architecture / notification engineering at scale.)
  • 2026-03-31 — sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness (Slack edge-networking team retrospective on rolling out HTTP/3 on the public edge and closing the HTTP/3 probing gap first. Existing SaaS observability vendors had zero native HTTP/3 probe support; Slack's own Prometheus Blackbox Exporter fleet ("a cornerstone of our monitoring") was TCP-shaped and couldn't speak QUIC/UDP. Intern Sebastian Feliciano scoped, implemented, and open-sourced an http3 module into BBE upstream built on systems/quic-go — selected for "wide adoption across other open source technologies, as well as the first-class support it provides in creating http clients in go." Integration snippet: http3Transport := &http3.Transport{TLSClientConfig: tlsConfig, QUICConfig: &quic.Config{}} wrapped in http.Client. Architectural discipline that earned the upstream merge: "had to add this new logic while following the Blackbox Exporter's existing architecture, ensuring the new features maintained the tool's configuration patterns." Because internship timelines ≠ OSS merge timelines, Sebastian "took matters into his own hands and architected an in- house system that utilized the new upstream features for probing out HTTP/3 endpoints" — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics unified in Grafana ("single pane of glass")
  • reliable HTTP/3 alerts + easier telemetry correlation. New canonical pages (5): 1 system (systems/prometheus-blackbox-exporter), 3 concepts (concepts/client-side-black-box-probe, concepts/http-3-probing-gap, concepts/observability-before-migration), 1 pattern (patterns/upstream-contribution-parallel-to-in-house-integration). Extended: systems/quic-go (second wiki instance at upstream-tooling altitude, distinct from the PlanetScale HTTP/3 driver benchmark instance), systems/prometheus (BBE axis added to the Airbnb- observability-ingestion-dominated page), systems/grafana (single-pane-of-glass unified multi-HTTP-version view), concepts/http-3 (probing-gap framing added alongside the existing Cloudflare latency framing), concepts/observability (observability-as-migration-gate altitude added), patterns/upstream-fixes-to-community (upstream-a-whole-new-feature altitude added alongside the existing fix-at-scale altitude). Scope takeaways verbatim: "Monitor first, and migrate second. ... getting observability right as a precursor to migration makes everything faster"; "Contributing open source pays dividends"; "Bet on your interns." Opens the Slack edge-networking-and-synthetic-monitoring axis on the wiki; sixth Slack coverage axis (after developer-productivity, deploy-safety, test-framework-integration, mobile-a11y, fleet-config-management, and build-systems). Tier-2 on- scope decisively: real engineering retrospective with architecture diagram, code snippet, upstream-PR reference, specific library selection rationale, and explicit migration-gate framing. Operational numbers thin (article is about monitoring, not HTTP/3 edge perf): hundreds of thousands of HTTP/3 endpoints to probe; zero SaaS vendors supporting HTTP/3 out-of-box at investigation time; code merged to BBE at pinned commit bee8e9102a106bff63281ee9c64c7b1275ef21d0. URL verbatim: https://slack.engineering/from-custom-to-open-scalable-network-probing-and-http-3-readiness-with-prometheus/.)
  • 2025-11-06 — sources/2025-11-06-slack-build-better-software-to-build-software-better (Slack Quip/Canvas team retrospective on taking their build from 60 minutes to as low as 10 minutes (~6× speed-up) by adopting Bazel + Starlark and doing the unglamorous engineering work to benefit from them. Load-bearing insight: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — the pre-existing build had cycles, non-hermetic action nodes, and cache keys with hundreds of parameters, giving a zero cache hit rate that no build tool could fix. Two concrete wins: severing the Python↔TypeScript dependency edge (saved ~35 min/build on its own; canonical patterns/decouple-frontend-build-from-backend-artifacts) and deleting in-process parallelization inside the frontend bundler so Bazel could schedule across bundles (canonical patterns/delete-inner-parallelization-inside-outer-orchestrator). Correctness oracle during refactor was a Rust byte-diff tool comparing old- and new-build artifacts (canonical patterns/diff-artifact-validator-for-build-refactor). Key new concepts: concepts/cache-granularity ("100 parameters, 2-3 always change" failure mode), concepts/idempotent-build-action (pre-refactor build mutated the working directory), concepts/layering-violation (frontend bundler fused business logic + orchestration + parallelization), concepts/separation-of-concerns applied across backend/frontend, Python/TypeScript, and application/build-code axes. Extends systems/bazel, concepts/hermetic-build, concepts/build-graph, concepts/cache-hit-rate with the zero-hit-rate structural-failure story. Operational numbers: 60 min → 10 min (best-case, cached+parallelised) / 12 min (average) / 30 min (cache miss). Opens the Slack build-system- engineering axis on the wiki; first Slack build-tooling ingest.)
  • 2025-11-19 — sources/2025-11-19-slack-android-vpat-journey (Slack Android team retrospective on triaging a 2024 third- party VPAT audit conducted after the IA4 UI redesign; 8 recurring accessibility themes identified (7 resolved, 1 deferred as future work). Representative-of-all-four- buckets worked example of [[patterns/vpat-driven-a11y- triage]]: shovel-ready fixes assigned immediately; recurring themes resolved at Slack Kit component- library layer (OutlinedTextField error-announcement, SKBanner error auto-announce, SKListAccessibilityDelegate overrides CollectionInfo to exclude decorative SK divider items from TalkBack's "N items in a list" count — canonical instance of patterns/accessibility-delegate-override-for-semantic-fix; workspace-switcher drag-and-drop via Edit-mode + six-dot handle + TalkBack custom actions "Move before" / "Move after" invoked by three-finger tap or L/r drawing gestures — canonical instance of patterns/custom-talkback-actions-as-gesture-alternative); platform-convention closures for top-app-bar-as-heading (WCAG 1.3.1; verified via native Google apps) and strikethrough-announcement (WCAG 1.3.1; verified via blind- community consultation) — canonical instances of concepts/wcag-platform-applicability-gap; error-icon redundancy resolving the colour-alone P3 theme — concepts/redundant-error-signalling (WCAG 1.4.1); scope-bounded deferral of WCAG 2.1.1 keyboard-nav because Slack Android has no tablet support. New canonical pages: source + 4 systems (systems/slack-android, systems/slack-kit, systems/talkback, systems/android-accessibility-framework) + 3 concepts (concepts/vpat-voluntary-product-accessibility-template, concepts/wcag-platform-applicability-gap, concepts/redundant-error-signalling) + 3 patterns (patterns/accessibility-delegate-override-for-semantic-fix, patterns/custom-talkback-actions-as-gesture-alternative, patterns/vpat-driven-a11y-triage). Extended: concepts/automated-vs-manual-testing-complementarity (new Seen-in: this post is the manual / third-party / periodic layer complementing the 2025-01-07 automated Axe-in-Playwright ingest). Opens the fourth axis of Slack coverage on the wiki: mobile accessibility engineering. Tier-2 borderline include on mobile-a11y-pattern- canonicalisation grounds; user explicit full-ingest override of prior batch-skip; same disposition as the 2025-01-07 and 2024-06-19 Slack developer-productivity ingests. No incident / latency / scale disclosures; pure engineering-process retrospective. URL verbatim: https://slack.engineering/android-vpat-journey/.)
  • 2025-10-23 — sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption (Archie Gunasekara's follow-up to Slack's 2024 Advancing Our Chef Infrastructure post describes phase two of Slack's EC2 / Chef deploy-safety work: instead of migrating to Chef Policyfiles (which would have required every service team to rewrite roles, environments, and cookbooks), Slack extended the existing EC2 framework in two load-bearing changes. (1) Splitting one production Chef environment into six AZ-bucketed environments (prod-1prod-6) rolled out via a release train with prod-1 as canary: prod-1 receives new versions every hour (hot canary); prod-2prod-6 advance via release train, with the next version gated on the previous version completing the train. Why the canary is parallel rather than head-of-train: "updating prod-1 frequently with the latest version allows us to detect issues closer to when they were introduced" rather than testing cumulative-change artifacts at the canary. Boot- time mapping from AZ to environment via Poptart Bootstrap (Slack's cloud-init-phase AMI tool) ensures newly provisioned nodes inherit the AZ-bucket boundary from instance 0 — the explicit fix for the scale-out-picks-up-bad-config failure mode that per-node cron staggering didn't address. (2) Replacing cron-driven Chef runs with a signal-driven pull model via a new service called Chef Summoner that watches an S3 bucket populated by the existing Chef Librarian artifact- promotion service at chef-run-triggers/<stack>/<env>. Signal payload carries Splay (randomised jitter), Timestamp, and a full ManifestRecord (artifact version, cookbook-versions map, S3 artifact pointer, upload_complete ordering flag). Summoner deduplicates against local state (last-run-time + artifact-version), applies Splay, and triggers chef-client. Plus a fallback cron baked into every AMI that independently triggers chef-client if Summoner hasn't run Chef in the last 12 hours — the recovery path for broken-Summoner deployments. Also enforces the 12-hour compliance SLA. Closes by marking the legacy EC2 platform feature-complete + maintenance-mode and previews a brand-new EC2 successor called Shipyard (service-level deployments + metric-driven rollouts + automated rollbacks) for teams that can't yet move to Bedrock. Canonical wiki primitives: 6 new systems (systems/chef + systems/chef-policyfiles + systems/chef-librarian + systems/chef-summoner + systems/poptart-bootstrap + systems/slack-shipyard), 6 new concepts (concepts/az-bucketed-environment-split + concepts/splay-randomised-run-jitter + concepts/signal-driven-chef-trigger + concepts/s3-signal-bucket-as-config-fanout + concepts/fallback-cron-for-self-update-safety + concepts/cookbook-artifact-versioning), 4 new patterns (patterns/split-environment-per-az-for-blast-radius + patterns/release-train-rollout-with-canary + patterns/signal-triggered-fleet-config-apply + patterns/self-update-with-independent-fallback-cron). Extends concepts/blast-radius with the fleet- configuration substrate altitude and systems/aws-s3 with the S3-as-config-fanout-bus altitude (previously canonicalised as object-store + CDC-log-store + tiered- cold-tier). Natural companion to the 2025-10-07 Deploy Safety retrospective — one level below the program-altitude framing, drilling into the EC2 / Chef substrate specifically. Tier-2 on-scope decisively: real distributed-systems internals, scaling-trade-off rationale, concrete operational disclosures (6 production environments / 12-hour compliance SLA / hourly promotions). Opens the Slack fleet-configuration-management axis.)
  • 2025-10-07 — sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change (Retrospective on Slack's 18-month Deploy Safety Program — 90% reduction in customer impact hours from change-triggered incidents (Feb-Apr 2024 peak → Jan 2025). Load-bearing framing statistic: 73% of customer-facing incidents were Slack-induced-change-triggered, particularly code deploys. Three North Star goals across all deployment systems for highest-importance services: 10-min automated MTTR / 20-min manual MTTR / detect before 10% fleet exposure. Canonical program metric: "Hours of customer impact from high-severity and selected medium-severity change-triggered incidents" — explicitly framed as "imperfect analog of customer sentiment" sitting in a three-layer chain Customer sentiment <-> Program Metric <-> Project Metric. Five-axis investment strategy: invest widely + bias for action / known-pain first / invest further based on results / curtail least impactful / flexible roadmap — explicit framing that below-expectation projects are "not failures" but "critical input." Phase-change architectural shift: "Once automatic rollbacks were introduced we observed dramatic improvement in results." Composed Webapp-backend investment sequence (Q1 metric monitoring → Q2 manual rollback via automatic alerts → Q3-Q4 automatic rollback → Q4+ ≤10 min customer impact → pattern copied to Webapp frontend → centralised deployment orchestration system inspired by [ReleaseBot] + AWS Pipelines). Trailing-metric-patience discipline with 3-6 month delivery-to-impact lag; mid-stream sub-signals. Tool fluency discipline — "Use the tooling often, not just for the infrequent worst case scenarios." Direct-per-team- outreach discipline — "Not all teams and systems are the same." Canonical wiki primitives: (1) systems/slack-deploy-safety-program (new) — the program as a named wiki system. (2) systems/slack-releasebot + systems/slack-bedrock (new) — the named ReleaseBot inspiration + Bedrock substrate as stub pages. (3) concepts/change-triggered-incident-rate (new) — the justifying statistic. (4) concepts/customer-impact-hours-metric (new) — the program-metric-as-sentiment-analog choice. (5) concepts/pre-10-percent-fleet-detection-goal (new) — blast-radius-cap at fleet-level. (6) concepts/trailing-metric-patience (new) — the patience discipline. (7) patterns/automated-detect-remediate-within-10-minutes (new) — the 10-min/20-min MTTR pair. (8) patterns/centralised-deployment-orchestration-across-systems (new) — the multi-substrate deploy-orchestrator pattern. (9) patterns/invest-widely-then-double-down-on-impact (new) — the five-axis strategy. Extends patterns/fast-rollback with the fully-automated altitude variant, concepts/feedback-control-loop-for-rollouts with the organisational-altitude instance, concepts/blast-radius with the fleet-percentage rollout-gate altitude, concepts/dora-metrics with the "maintain development velocity" co-equal North Star, concepts/observability with the Q1-first-investment framing. Opens the Slack reliability-engineering axis on the wiki; first Slack production-reliability ingest.)
  • 2025-01-07 — sources/2025-01-07-slack-automated-accessibility-testing-at-slack (Slack Frontend Test Frameworks team retrospective on integrating Axe Core into Slack's pre-existing Playwright E2E framework as a custom-fixture extension. Two failed integration attempts first: baking Axe into RTL's render (blocked by Slack's customised Jest setup) and baking Axe into Playwright's Locator interaction methods (blocked by Locator auto-wait semantics). Landing architecture: slack.utils.a11y.runAxeAndSaveViolations() on the pre-existing custom slack fixture, invoked explicitly at page-ready. Canonicalises five reusable patterns: fixture- extension as integration surface for cross-cutting testing concerns, two-axis exclusion list (ticketed-known-issue + out-of-scope-by- design), severity- gated reporting (critical only at launch, serious / moderate / mild as future work), tri-mode opt-in test execution (A11Y_ENABLE flag composes on-demand local + scheduled nightly Buildkite + opt-in CI gate), alert- channel-to-Jira auto-ticket workflow (Slack alert channel spins up pre-populated Jira ticket with canonical label + Epic placement). Four new concepts: concepts/wcag-2-1-a-aa-scope, concepts/automated-vs-manual-testing-complementarity, concepts/playwright-locator-auto-wait, concepts/severity-filtered-violation-reporting. Three new systems: systems/axe-core, systems/axe-core-playwright, systems/jira. 91 tests in initial suite, non-blocking, WCAG 2.1 A+AA scope, critical impact only. Borderline Tier-2 ingest — developer-productivity rather than distributed-systems internals; same disposition as the 2024-06-19 Enzyme→RTL codemod ingest; the canonicalised patterns generalise to any automated-check-integration-into- existing-test-suite problem.)
  • 2024-08-26 — sources/2024-08-26-slack-unified-grid-how-we-re-architected-slack-for-our-largest-customers (Slack's 2024-08-26 retrospective on the Unified Grid project — a decade-after-launch re-architecture of the Slack client and backend's fundamental tenant-scoping assumption, shifting from workspace-centric to org-wide.)
  • 2024-06-19 — sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library (Retrospective on migrating 15,000+ Enzyme tests to React Testing Library using an AST + LLM hybrid pipeline. AST-only hit ~45% ceiling; LLM-only (Claude 2.1) was 40-60% with wild variance; hybrid reached ~80% on selected evaluation files. Per-test-case DOM captured by instrumenting Enzyme mount/shallow. In-code annotation comments from AST pass shaped LLM output. Tool later open-sourced as @slack/enzyme-to-rtl-codemod. ~64% adoption across Slack's RTL migration; 338-file CI-nightly run produced ~500 auto- passing test cases (~22% documented developer-time savings, lower-bound.))
Last updated · 542 distilled / 1,571 read