Slack¶
Slack Engineering blog (slack.engineering). Tier-2 source on the sysdesign-wiki. Slack is a workplace-messaging platform (acquired by Salesforce in 2020) with substantial engineering output across backend (Flannel, Vitess-for-Slack), mobile (cross-platform client architecture), frontend (TypeScript/ React at large scale, shared-edit collaboration), and developer infrastructure (CI, test frameworks, migration tooling).
This wiki's coverage of Slack spans eight axes so far:
- Developer-productivity tooling at scale — Slack's public retrospective on using LLMs to automate a 15,000-test Enzyme → RTL migration, which canonicalised a reusable AST + LLM hybrid conversion pattern.
- Reliability engineering at scale — Slack's 18-month Deploy Safety Program (mid-2023 → Jan 2025) that reduced customer impact hours from change-triggered incidents by 90%, canonicalised in the 2025-10-07 retrospective.
- Test-framework integration at scale — Slack's 2022- launched integration of Axe Core accessibility checks into the existing Playwright E2E suite as a custom- fixture extension, canonicalising several reusable patterns (fixture-extension as integration surface, two-axis exclusion-list, severity-gated reporting, tri-mode opt-in execution, alert-to-Jira workflow).
- Mobile accessibility engineering — Slack's 2024 third-
party VPAT audit of the IA4-redesigned Android client;
8 recurring themes (7 resolved, 1 deferred) with fixes
concentrated at Slack Kit component-
library layer. Canonicalised the
patterns/accessibility-delegate-override-for-semantic-fix
pattern (Slack's new
SKListAccessibilityDelegatefixingCollectionInfofor decorative dividers), the patterns/custom-talkback-actions-as-gesture-alternative pattern (workspace-switcher drag-reorder via TalkBack "Move before" / "Move after" actions + six-dot drag handle + Edit mode), and the patterns/vpat-driven-a11y-triage four-bucket workflow. Complements the 2025-01-07 automated-Axe ingest as the manual / third-party / periodic layer of the same broader a11y strategy (see concepts/automated-vs-manual-testing-complementarity). - Fleet-configuration-management at scale — Slack's 2025-10-23 Chef phase-2 post canonicalises the AZ-bucketed environment split, the signal-driven fleet-config-apply pipeline (Chef Librarian → S3 → Chef Summoner), the release-train-with-canary rollout, and the self-update- with-independent-fallback-cron pattern — one-level-below the Deploy Safety Program's altitude, in the EC2 / Chef substrate specifically.
- Build-system engineering at scale — Slack Quip/Canvas team's 60 min → 10 min (~6×) build speed-up via Bazel + Starlark adoption, canonicalised in the 2025-11-06 retrospective. The load-bearing insight: Bazel gives nothing to a build whose graph has cycles / non-hermetic actions / coarse cache keys; the real wins came from applying classical separation-of-concerns and layering principles to build code itself. Canonicalised three new patterns — patterns/decouple-frontend-build-from-backend-artifacts, patterns/delete-inner-parallelization-inside-outer-orchestrator, patterns/diff-artifact-validator-for-build-refactor — plus the concepts/cache-granularity and concepts/idempotent-build-action concepts.
- Edge-networking and synthetic monitoring — Slack's 2026-03-31 retrospective on rolling out HTTP/3 on the public edge. Closed the HTTP/3 probing gap — neither SaaS observability tools nor Slack's Prometheus Blackbox Exporter fleet ("a cornerstone of our monitoring") natively spoke QUIC/UDP before the intern project. Intern Sebastian Feliciano scoped, implemented, and open-sourced QUIC support into Prometheus Blackbox Exporter upstream using systems/quic-go as the client library, then built an in-house integration on the same branch so Slack could ship HTTP/3 probing to production before upstream merge — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Extends the patterns/upstream-fixes-to-community pattern with a new altitude: upstream-a-whole-new-feature, distinct from the Shopify × Reanimated fix-existing- feature-at-scale altitude. Canonicalised Slack's explicit "monitor first, migrate second" takeaway as the concepts/observability-before-migration concept. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics in one Grafana "single pane of glass".
- Security-engineering + AI-agent operations — Slack Security Engineering team's 2025-12-01 retrospective on Spear, their multi-agent security-investigation service that triages detection- system alerts during on-call shifts. Opens the first post in a promised series. Canonical first wiki instance of the patterns/director-expert-critic-investigation-loop pattern (three-persona agent team: Director plans / progresses phases / writes final report, four Experts — Access / Cloud / Code / Threat — produce domain-specific findings, Critic audits + condenses). Canonical first wiki instance of the concepts/knowledge-pyramid-model-tiering concept (Experts on cheap models, Critic on mid-tier, Director on top-tier). Canonical first wiki instance of the patterns/hub-worker-dashboard-agent-service pattern (Hub + Worker + Dashboard productisation shape). Load-bearing architectural lesson verbatim: "prompts are just guidelines; they're not an effective method for achieving fine-grained control" — canonical statement of concepts/prompt-is-not-control. Slack's canonical emergent-behaviour worked example: Critic caught a credential exposure the Expert missed, Director pivoted the investigation. Canonicalised four concepts (concepts/knowledge-pyramid-model-tiering, concepts/investigation-phase-progression, concepts/weakly-adversarial-critic, concepts/prompt-is-not-control) + four patterns (patterns/director-expert-critic-investigation-loop, patterns/one-model-invocation-per-task, patterns/hub-worker-dashboard-agent-service, patterns/phase-gated-investigation-progression) + one system (systems/slack-spear).
- Preference-architecture / notification engineering at
scale — Slack's 2026-03-19 retrospective on the
Notifications 2.0 rebuild that migrated millions of
users from four conflicting preference models (desktop /
mobile × content-selection / delivery-channel conflated
into one enum each) to a single unified hierarchy
(All new messages / Mentions and DMs / Mute ×
independent desktop + mobile push toggles × explicit
Advanced override dimension) — without a
database-level backfill. The load-bearing architectural
move is patterns/read-time-schema-translation: a new
desktop_push_enabledboolean was added and backfilled from the legacy "off" value; at every read site, legacydesktop: 'off'is translated todesktop: 'mentions' desktop_push_enabled: falseso behaviour is byte-identical but expressible in the new decoupled schema. Rollback is automatic because storage bytes never changed. Canonicalises three patterns (patterns/read-time-schema-translation, patterns/decouple-what-from-how-in-preferences, patterns/unified-preference-model-for-cross-client-state)- seven concepts (concepts/read-time-preference-translation, concepts/preference-schema-decoupling, concepts/cross-platform-preference-parity, concepts/mental-model-preference-coherence, concepts/support-burden-as-architecture-signal, concepts/explicit-state-over-implicit-sync, concepts/auto-save-modal-ux-coherence) + one system (systems/slack-notifications-2-0). Notifications were "one of the top three drivers of Customer Experience tickets" pre-project — support burden as architecture signal canonicalised. Post- launch: 5× increase in settings engagement sustained for weeks; majority chose "Mentions and DMs" default; fewer per-channel overrides needed. One disclosed data-integrity incident: "A malformed field once reset preferences to Mentions until we cleaned data and flushed memcache" — surfaces the thin-validation + aggressive-caching + default-fallback compound failure shape of heavily- cached preference stores.
- Data-platform security & substrate modernisation —
Slack's 2026-05-05 retrospective on eliminating SSH
entirely from its EMR data
pipelines. 700+ production jobs across 8 independent
data regions, 7 operator types, 5 teams, 3 quarters,
zero downtime. The architectural unblocker was a
single REST gateway — Quarry —
that fronts YARN /
Trino / Snowflake
from one auth surface and one log schema. The
breakthrough enabler for shell-class jobs was
discovering that
YARN Distributed
Shell was already a REST-submittable shell runner
that could replace 300+ CLI-based SSH jobs without
a custom remote-execution service. SSH's costs
surfaced as structural blockers — the article
verbatim: "We were blocked: Couldn't start the path
for Spark on Kubernetes nor EMR on EKS […] Couldn't
complete our Whitecastle initiative because we needed
to move the last main-account EMR clusters to child
accounts." Two named challenges illuminate the
latent failure modes SSH was hiding:
vmem-check
failures (jobs that ran fine via SSH had been
silently exceeding YARN's resource limits — fix:
AWS-recommended
yarn.nodemanager.vmem-check-enabled: false) and an EKM connectivity timeout during cross-cluster migration that "surfaced a hidden dependency on network topology that wasn't captured in the job's configuration." The disciplined rollout used patterns/incremental-operator-by-operator-migration (CrunchExecOperator,S3SyncOperator, etc., one at a time as mini-projects) over 5 phases (POC → Security Review → OKR-Driven Execution → Bulk Migration → Final Cleanup) tracked via an analytics dashboard backed by Airflow metadata-DB queries — "Build monitoring before you migrate." Canonicalises three patterns (patterns/rest-gateway-for-compute-engine-job-submission, patterns/yarn-distributed-shell-as-universal-shell-executor, patterns/incremental-operator-by-operator-migration), four concepts (concepts/rest-based-job-submission, concepts/ssh-job-execution-anti-pattern, concepts/resource-enforcement-bypass-via-ssh, concepts/master-node-resource-contention), and three systems (systems/slack-quarry, systems/yarn-distributed-shell, systems/apache-yarn). Extends concepts/observability-before-migration (the project-progress altitude — Airflow-DB-backed dashboard tracks remaining SSH jobs per region per team — sibling to the prior HTTP/3 transport-probe altitude), concepts/attack-surface-minimization (the structural-blocker framing of "can't modernise anything until SSH dies"), concepts/long-lived-key-risk (eliminates the long-lived-SSH-key class at industrial scale via service-token replacement), and concepts/audit-trail (Quarry's per-submission structured logs are the new audit substrate — "No more 'who ran that command?' mysteries"). - Multi-cloud LLM serving (Slack AI infrastructure) — Slack's 2026-05-28 retrospective on the three-year evolution of the Slack AI LLM serving substrate from single-region AWS SageMaker (early 2023, with escrow VPC for Anthropic) → fully managed Amazon Bedrock Provisioned Throughput (mid-2024, zero-incident migration — canonical patterns/zero-incident-llm-migration) → Hybrid PT+OD with spillover (mid-2025, canonical patterns/provisioned-throughput-with-on-demand-spillover) → multi-cloud Bedrock + GCP Vertex AI (early 2026). The architectural endpoint is Slack's Intelligent Routing Layer — five subsystems hidden behind a unified internal API: metric-driven model selection (per-feature primary + designated backup), A/B experimental routing (feedback loop "weeks → days"), automated circuit breaker with partial-open recovery state (TTFT
- p90 latency + 5xx error rate + customer-feedback trends as triggers), API normalisation layer, secretless cross-cloud authentication. Disclosed Phase 4 outcomes: ~10% quality lift on complex reasoning + ~67% latency reduction on high-velocity / low-token workloads via per-feature model-to-feature binding. Three-year arc canonicalises four structural failure modes that drove the evolution: model feature lag (Phase 1 → 2), over- provisioning cycle + commitment lock-in (Phase 2 → 3), concentration risk on single-cloud LLM serving (Phase 3 → 4). Slack adopts the MU primitive on the customer side (sibling to Databricks' platform-side coining of the same primitive). Five reflections verbatim: "Scaling safely requires XFN parity / The abstraction layer is a core requirement / Treat architecture as a living document / Reliability requires provider agnosticism / Redefining the meaning of 'Failure'". Canonicalises 3 systems (systems/slack-ai, systems/slack-intelligent-routing-layer, systems/gcp-vertex-ai), 6 concepts (concepts/multi-cloud-llm-serving, concepts/escrow-vpc-llm-serving, concepts/llm-model-feature-lag, concepts/provisioned-throughput-vs-on-demand-llm, concepts/llm-over-provisioning-cycle, concepts/llm-provider-commitment-lock-in, concepts/api-normalization-multi-cloud-llm, concepts/model-to-feature-binding, concepts/concentration-risk-single-cloud-llm, concepts/automated-circuit-breaker-with-partial-open-state), and 5 patterns (patterns/multi-cloud-llm-serving, patterns/provisioned-throughput-with-on-demand-spillover, patterns/api-normalization-layer-cross-provider, patterns/model-fallback-hierarchy-with-circuit-breaker, patterns/zero-incident-llm-migration). Extends systems/aws-sagemaker-ai (Phase 1 escrow-VPC face), systems/amazon-bedrock (Phase 2/3 substrate face + PT+OD primitive disclosure + Slack's customer-side MU adoption), concepts/model-units (customer-side adoption sibling to Databricks platform-side coining), patterns/circuit-breaker (partial-open ramp refinement). Tenth Slack coverage axis on the wiki — structurally one altitude above all prior Slack AI work (the developer-productivity LLM-codemod axis from 2024-06-19, the security-investigation Spear axis from 2025-12-01, the agent-context-engineering axis from 2026-04-13) — this is the infrastructure substrate on which all consumer-facing Slack AI features ride.
Key systems¶
- systems/slack-ai — the Slack AI feature suite (channel summaries, Recap, AI Search, related generative / extractive surfaces). Production substrate at "millions of users" scale by early 2026. Built starting early 2023 on AWS SageMaker; migrated to Amazon Bedrock mid-2024 with zero customer-facing incidents; expanded to multi-cloud Bedrock + GCP Vertex AI in early 2026. FedRAMP Moderate maintained across all phases. Sits on top of Slack's internal Kubernetes substrate ( Slack Bedrock — not to be confused with Amazon Bedrock, the LLM service).
- systems/slack-intelligent-routing-layer — Slack AI's internal LLM abstraction layer that fronts every model provider and exposes a unified internal contract. Five subsystems: metric-driven model selection (per-feature primary + designated backup), A/B experimental routing (feedback loop weeks → days), automated circuit breaker with partial-open recovery state (TTFT + p90 latency + 5xx error rate + customer-feedback trends as triggers), API normalisation layer (unifies error codes / rate-limits / telemetry / auth across providers), secretless cross-cloud authentication. Architectural endpoint of Slack's three-year Slack AI evolution. Production substrate behind ~10% quality lift on complex reasoning + ~67% latency reduction on high-velocity / low-token workloads.
- systems/gcp-vertex-ai — Google Cloud's managed LLM platform; the second cloud Slack added alongside AWS Bedrock in early 2026 to break single-cloud concentration risk and access vendor-exclusive frontier models. Wiki's first canonical disclosure of Vertex AI as an enterprise multi-cloud LLM endpoint.
- systems/amazon-bedrock — AWS's managed LLM platform; Slack's Phase 2 (mid-2024) substrate after migrating from self-managed SageMaker. PT for latency-sensitive features (channel summaries, AI Search), OD for bursty workloads (nightly Recap), spillover from PT to OD when reserved capacity ceiling is hit. Capacity primitive: Model Units (Slack adopts MUs on the customer side; sibling to Databricks' platform-side coining).
- systems/aws-sagemaker-ai — AWS's managed ML platform; Slack's Phase 1 (early 2023) substrate. Hosted Anthropic models inside an escrow VPC establishing zero-knowledge between Slack and the provider. Multi-region with cross-region IAM, ODCR + cron scaling. Operational taxes (scaling latency, GPU scarcity, over-provisioning, model feature lag) drove Phase 2 migration to Bedrock.
- systems/slack-deploy-safety-program — the 18-month reliability program; 90% reduction in customer impact hours; 10-min automated MTTR / 20-min manual MTTR / detect-before-10%- fleet North Stars; "imperfect analog of customer sentiment" program metric; exec-sponsored; OKR-weighted.
- systems/slack-releasebot — Slack's 2018-era metrics-based- deploy + automatic-rollback orchestrator for Webapp backend; the blueprint the 2023-2025 centralised orchestration system generalises across substrates.
- systems/slack-bedrock — Slack's internal compute platform over Kubernetes; the substrate on which the first fully- automated metrics-based-deploy-with-auto-rollback regime was built.
- systems/slack-spear — Slack's multi-agent security- investigation service that triages detection-system alerts during on-call shifts. Three-persona agent team (Director / 4 Experts / Critic) running across three phases (discovery / trace / conclude) on a Hub + Worker + Dashboard service architecture. Name inferred from image-asset URL slugs; post uses "our service".
- systems/slack-notifications-2-0 — Slack's rebuilt notification-preference system; migrated millions of users from four per-client preference enums (desktop / mobile × content-selection / delivery-channel conflated) to one unified hierarchy (All new messages / Mentions and DMs / Mute × independent desktop + mobile push toggles × Advanced override dimension) via read-time schema translation — no database backfill, byte-identical rollback. Post-launch: 5× settings engagement, majority picked "Mentions and DMs" default, fewer per-channel overrides. One disclosed data-integrity incident (malformed field + memcache served default values → silent preferences-reset).
- systems/slack-shipyard — Slack's upcoming successor to the legacy Chef-based EC2 platform. Preview-only in the 2025-10-23 post; service-level deployments, metric-driven rollouts, fully-automated rollbacks; soft launch Q4 2025 with two pilot teams. For teams that can't yet move to Bedrock.
- systems/slack-quarry — Slack's REST-based job-submission gateway between Airflow and multiple compute engines (YARN on EMR, Trino, Snowflake). Replaced 700+ SSH-based jobs across 8 data regions over 3 quarters with zero downtime — service-to-service tokens at the gateway edge, server-side job-state tracking surviving Kubernetes pod restarts, structured per-submission logs as the audit substrate. The breakthrough that made it viable for all job types was discovering YARN Distributed Shell was already a REST-submittable shell runner. Canonical wiki instance of patterns/rest-gateway-for-compute-engine-job-submission.
- systems/yarn-distributed-shell — the YARN ApplicationMaster that runs arbitrary shell scripts in YARN containers via the standard YARN REST API. "A little-known feature […] already part of YARN, used the same REST APIs, and required no custom security layer." The breakthrough enabler for Quarry's universal job-submission promise.
- systems/apache-yarn — the resource manager Quarry/Slack submits to; surfaced the latent resource-violations that SSH-to-master-node had been silently bypassing.
- systems/chef / systems/chef-librarian / systems/chef-summoner / systems/poptart-bootstrap / systems/chef-policyfiles — Slack's legacy EC2 fleet- configuration substrate and its phase-2 components (the Policyfiles alternative was explicitly rejected on blast- radius-of-change grounds).
- systems/enzyme-to-rtl-codemod — Slack's open-sourced
(
@slack/enzyme-to-rtl-codemod) AI-powered test-conversion pipeline: AST codemod handles deterministic Enzyme→RTL rewrites + writes in-code annotation comments for hard cases, rendered DOM is captured per-test- case by instrumenting Enzyme'smount/shallow, and an LLM (Anthropic Claude 2.1 at time of post) consumes the annotated file + DOM + structured prompt to finish the conversion. ~80% quality on evaluation files; ~64% adoption across Slack's RTL migration. - systems/enzyme — testing framework Slack was migrating away from (no React 18 adapter).
- systems/react-testing-library — target framework.
- systems/claude-2-1 — LLM backend used in the original pipeline (via Slack's internal DevXP AI team integration).
- systems/jest — underlying test runner.
- systems/playwright — Slack's E2E framework. Used as the integration substrate for Axe accessibility checks via the custom-fixture extension pattern.
- systems/axe-core / systems/axe-core-playwright — Deque Systems' accessibility rule engine and its Playwright binding. Slack's 2022-launched integration into the Playwright E2E suite.
- systems/buildkite — Slack's CI orchestrator; hosts the nightly a11y regression run (one leg of Slack's tri-mode opt-in execution pattern).
- systems/jira — Slack's ticket-tracking tool; receives auto-created tickets from the Slack a11y alert-channel workflow (patterns/alert-channel-to-jira-auto-ticket-workflow).
- systems/slack-android — Slack's native Android client; anchors the 2024 VPAT retrospective's mobile-a11y disclosures. Phone-only (no large-form-factor support yet).
- systems/slack-kit — Slack's shared mobile component
library (SK). Components surfaced on the wiki so far:
OutlinedTextField,SKBanner,SKList/SKListAdapter,SK divider, and theSKListAccessibilityDelegateintroduced by the VPAT resolution. - systems/talkback — Android's built-in screen reader; the primary assistive-tech target for Slack Android's a11y work.
- systems/android-accessibility-framework — Android
platform a11y APIs (
AccessibilityDelegate,AccessibilityNodeInfoCompat,CollectionInfo, custom- action API) that the Slack Kit fixes plug into. - systems/prometheus-blackbox-exporter — "a cornerstone
of our monitoring" at Slack; canonical
client-side black-
box probing substrate for edge endpoints. Slack intern
Sebastian Feliciano open-sourced the
http3module into this upstream project to close the HTTP/3 probing gap, built on systems/quic-go. - systems/quic-go — Go QUIC library Slack's BBE
http3contribution is built on; "wide adoption across other open source technologies, as well as the first- class support it provides in creating http clients in go" was the selection rationale. - systems/prometheus — the TSDB backing Slack's edge-probing metrics pipeline.
- systems/grafana — Slack's "single pane of glass" dashboarding layer; canonically unifies HTTP/1.1 + HTTP/2
- HTTP/3 probe metrics side-by-side post-BBE-contribution.
- systems/enterprise-grid / systems/unified-grid / systems/slack-rtm — Slack's tenancy substrate before / after / pre-requisite for the 2024 Unified Grid re-architecture.
- systems/slack-quip / systems/slack-canvas — the collaborative-document + in-Slack canvas surfaces whose shared Python-backend + TypeScript-frontend build pipeline is the subject of Slack's 2025-11-06 build-system retrospective (60 min → 10 min).
- systems/bazel / systems/starlark — the build system and its constrained DSL adopted by the Quip/Canvas team; central to the 60 min → 10 min story. Slack's framing: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — adopting Bazel alone achieves nothing; the engineering work to meet its preconditions is where the speed-up lives.
Key patterns / concepts¶
- patterns/automated-detect-remediate-within-10-minutes — Slack's canonical 10-min-auto / 20-min-manual MTTR pair as deployment-safety North Star; paired with the customer 10-minute-disruption threshold.
- patterns/centralised-deployment-orchestration-across-systems — the architectural generalisation of Slack ReleaseBot's Webapp-backend-only automation to Webapp frontend, infra, EC2, Terraform.
- patterns/invest-widely-then-double-down-on-impact — the five-axis investment strategy for data-scarce trailing-metric reliability programs; explicit framing that below-expectation projects are "not failures".
- concepts/customer-impact-hours-metric — the program-metric choice as "imperfect analog of customer sentiment"; the three-layer sentiment↔program-metric↔project-metric chain.
- concepts/change-triggered-incident-rate — 73% of customer-facing incidents were change-triggered at program start; the statistic that justified the program.
- concepts/pre-10-percent-fleet-detection-goal — canonical "detect before 10% of fleet" blast-radius-cap goal.
- concepts/trailing-metric-patience — 3-6 month lag from project delivery to impact visibility; mid-stream sub-signals discipline.
- patterns/split-environment-per-az-for-blast-radius —
Slack's 2025-10-23 Chef phase-2 pattern: split one
prodChef environment into six AZ-bucketed environments (prod-1…prod-6) with boot-time mapping via Poptart Bootstrap. Bounds the scale-out-picks-up-bad-config failure mode. - patterns/release-train-rollout-with-canary — Slack's
prod-1hourly canary +prod-2→prod-6strictly sequential release train; canary is parallel to the train, not at its head, so it tests incremental diffs rather than cumulative ones. - patterns/signal-triggered-fleet-config-apply — Slack's Chef Librarian → S3 → Chef Summoner pipeline that replaced fixed-cron Chef runs; canonical first wiki instance of S3 as a config-management signal bus.
- patterns/self-update-with-independent-fallback-cron — Slack's solution to the self-update paradox: Chef Summoner updates itself via Chef runs, and a 12-hour fallback cron baked into every AMI triggers Chef runs independently if Summoner hasn't in the last 12 hours.
- concepts/az-bucketed-environment-split — the concept behind the AZ-bucketed pattern.
- concepts/signal-driven-chef-trigger — the concept behind the signal-triggered pattern.
- concepts/s3-signal-bucket-as-config-fanout — the concept behind S3 as a config-fanout substrate.
- concepts/splay-randomised-run-jitter — thundering-herd mitigation for signal-driven fleet-config runs.
- concepts/fallback-cron-for-self-update-safety — the concept behind the fallback-cron pattern.
- concepts/cookbook-artifact-versioning — the Chef- ecosystem rollout unit.
- patterns/ast-plus-llm-hybrid-conversion — compose an AST pre-pass with an LLM for deterministic code transformation; Slack lifted conversion quality from 40-60% (pure LLM) to ~80% (hybrid).
- patterns/in-code-annotation-as-llm-guidance — AST pass writes comments next to hard call sites to shape LLM attention, instead of stuffing everything into the system prompt. Slack: "we successfully minimized hallucinations and nonsensical conversions from the LLM".
- concepts/abstract-syntax-tree — canonical role both as conversion primitive and as hallucination-control primitive for LLM pipelines.
- concepts/llm-hallucination — Slack's named failure mode for pure-prompt-based code conversion.
- concepts/llm-conversion-hallucination-control — the structural problem class this post articulates.
- concepts/dom-context-injection-for-llm — capturing per-test-case rendered DOM via Enzyme render-method instrumentation, injecting into the LLM prompt.
- patterns/a11y-checks-via-playwright-fixture-extension — canonical integration surface for adding cross-cutting testing concerns (a11y, perf, visual regression) to an existing Playwright suite via the custom-fixture model. Slack's 2022-launched Axe integration is the canonical instance.
- patterns/exclusion-list-for-known-issues-and-out-of-scope-rules — two-axis exclusion list (known-issue-ticketed + out-of-scope-by-design) applied as pre-audit filter so automation signal stays high.
- patterns/severity-gated-violation-reporting — report only highest-severity bucket at new-system launch; defer lower severities to explicit future work. Canonical anti- alert-fatigue rollout lever.
- patterns/tri-mode-opt-in-test-execution — single default-off environment flag composing three execution modes (on-demand local + scheduled nightly Buildkite + opt-in CI gate) for new test classes.
- patterns/alert-channel-to-jira-auto-ticket-workflow — violation output terminates in pre-populated Jira ticket rather than free-form alert; the alert channel becomes the ticket-creation UI.
- concepts/wcag-2-1-a-aa-scope — WCAG 2.1 A+AA as the
conventional scope-picker for automated a11y; expressed in
Axe via
wcag2a/wcag2aa/wcag21a/wcag21aatag set. - concepts/automated-vs-manual-testing-complementarity — automated testing is a layer in a broader strategy, not a substitute for manual testing (especially in a11y where screen-reader UX requires human judgment).
- concepts/playwright-locator-auto-wait — Playwright's Locator auto-wait guarantees element-level readiness but not whole-page readiness; this is why whole-page audits (a11y, visual regression, perf snapshots) can't be embedded into Locator interaction methods.
- concepts/severity-filtered-violation-reporting — the concept behind patterns/severity-gated-violation-reporting; narrow-filter-at-launch + widen-as-calibrated.
- patterns/accessibility-delegate-override-for-semantic-fix
— the Android pattern of overriding
AccessibilityDelegateto correct framework-semantic defaults at the component-library layer; Slack'sSKListAccessibilityDelegatefixing decorative-divider list-count is the canonical instance. - patterns/custom-talkback-actions-as-gesture-alternative — every gesture-only interaction (drag-and-drop, swipe-to- dismiss, etc.) gets a custom TalkBack action as its accessibility alternative; Slack's workspace-switcher "Move before" / "Move after" actions paired with Edit-mode six-dot handle is the canonical instance.
- patterns/vpat-driven-a11y-triage — the end-to-end VPAT-audit-to-remediation workflow with four resolution buckets (shovel-ready / component-library theme / platform-convention closure / scope-bounded deferral).
- concepts/vpat-voluntary-product-accessibility-template — the third-party accessibility audit artifact; procurement-facing, engineering-backlog-input.
- concepts/wcag-platform-applicability-gap — WCAG is web-centric; native-platform a11y conventions sometimes legitimately diverge (Slack's top-app-bar-as-heading and strikethrough-announcement closures are the canonical examples).
- concepts/redundant-error-signalling — the WCAG 1.4.1 discipline of never-colour-alone; pair colour with icon, text, and screen-reader announcement.
- concepts/workspace-scoped-to-org-wide-migration — the Unified Grid re-architecture primitive.
- patterns/prototype-the-path — Slack's named methodology for dogfood-driven incremental re-architecture.
- patterns/decouple-frontend-build-from-backend-artifacts — Slack Quip/Canvas's ~35-min-per-build win by cutting the Python-backend → TypeScript-frontend cache-key edge.
- patterns/delete-inner-parallelization-inside-outer-orchestrator — Slack's frontend-bundler refactor: delete in-process worker pool so Bazel can parallelise at bundle granularity across workers.
- patterns/diff-artifact-validator-for-build-refactor — Slack's Rust-based byte-diff harness that served as the correctness oracle during the Bazel migration.
- patterns/upstream-fixes-to-community — Slack's HTTP/3/QUIC support into Prometheus Blackbox Exporter canonicalises the upstream-a-whole-new-feature altitude (distinct from the Shopify × Reanimated fix-existing- feature-at-scale altitude).
- patterns/upstream-contribution-parallel-to-in-house-integration — Slack's in-house BBE integration running in parallel with the upstream PR so the intern-timeline HTTP/3 rollout wasn't gated on maintainer merge.
- concepts/cache-granularity — the "100 parameters, 2-3 always change" failure mode of coarse cache keys that Slack canonicalised at build-system altitude.
- concepts/idempotent-build-action — the precondition Slack's pre-refactor build violated (build steps mutated the working directory).
- concepts/separation-of-concerns — the classical principle Slack applied to build code, release code, and setup code — not just application code.
- concepts/layering-violation — the structural diagnosis for Slack's old frontend bundler (business logic fused with orchestration + parallelization).
- concepts/http-3-probing-gap — the first-order observability failure Slack's HTTP/3 edge rollout surfaced: TCP-shaped black-box probers cannot see QUIC/UDP traffic. Canonical wiki instance.
- concepts/client-side-black-box-probe — the monitoring primitive the HTTP/3 probing gap applies to; Slack's BBE fleet is the canonical implementation substrate.
- concepts/observability-before-migration — Slack's explicit "monitor first, migrate second" takeaway; new wiki concept that generalises the discipline across transport / protocol / platform migrations.
- patterns/director-expert-critic-investigation-loop — Slack Spear's three-persona agent team for security- investigation triage; Director plans / Experts produce findings / Critic audits + condenses. Canonical first wiki instance.
- patterns/one-model-invocation-per-task — Slack's decomposition of the 300-word single-prompt prototype into per-task invocations with per-task structured-output schemas.
- patterns/hub-worker-dashboard-agent-service — Slack's three-component productisation shape for Spear, replacing the coding-agent-CLI-as-harness prototype.
- patterns/phase-gated-investigation-progression — Slack's explicit phase-gated loop (discovery → trace → conclude + Director-Decision meta-phase); each phase has its own model parameters + token budget.
- concepts/knowledge-pyramid-model-tiering — Slack's explicit three-tier cost/capability gradient: cheap Experts / mid-tier Critic / top-tier Director. Canonical first wiki instance.
- concepts/investigation-phase-progression — the concept behind the phase-gated pattern; phase as application state, not prompt state.
- concepts/weakly-adversarial-critic — Slack's named stance for the Critic/Expert relationship. Catches hallucination without degenerating into paranoia.
- concepts/prompt-is-not-control — Slack's verbatim architectural lesson: "prompts are just guidelines; they're not an effective method for achieving fine-grained control."
- patterns/three-channel-context-architecture — Slack Spear's load-bearing 2026-04-13 contribution: three complementary channels (Director's Journal + Critic's Review + Critic's Timeline) replace raw message history between agent invocations; each channel is consumed by different agents with different model tiers.
- patterns/critic-tool-call-introspection-suite — Slack's
four-tool suite giving the Critic methodology-audit
capability:
get_tool_call,get_tool_result,get_toolset_info,list_toolsets. Closes the claims- auditor → methodology-auditor gap. - patterns/timeline-assembly-from-scored-findings — the Critic's second task, run after the Review: assemble credible findings into a chronological narrative with four consolidation rules + 5-level narrative-coherence rubric + top-3 gap identification.
- concepts/structured-journaling-tool — the Director's six-typed-entry working-memory tool (decision / observation / finding / question / action / hypothesis + priority + follow-ups + citations + auto-annotated phase/round/ timestamp).
- concepts/credibility-scoring-rubric — Slack's 5-level credibility rubric (Trustworthy 0.9-1.0 / Highly-plausible 0.7-0.89 / Plausible 0.5-0.69 / Speculative 0.3-0.49 / Misguided 0.0-0.29) with disclosed distribution over 170,000 findings: 37.7% / 25.4% / 11.1% / 10.4% / 15.4%. Largest empirical credibility-scoring disclosure on the wiki.
- concepts/narrative-coherence-as-hallucination-filter — Slack's canonical framing of the Timeline task's second hallucination filter: "A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with."
- concepts/gap-identification-top-n — Slack's deliberate- scarcity design choice: Timeline identifies top-3 gaps across three categories (evidential / temporal / logical). Hard cap prevents reader fatigue + forces Critic triage.
- concepts/no-message-history-carry-forward — Slack's verbatim architectural discipline: "Besides these resources, we do not pass any message history forward between agent invocations." Justified on two grounds: token-budget pressure AND cognitive-load-management (over- sharing "stifles creativity and encourages confirmation bias" even with infinite context).
- concepts/online-context-summarisation — the round-by-
round summarisation pattern that replaces raw message
history; Timeline task is a fold over
(prev_timeline, latest_review, journal). - patterns/read-time-schema-translation — Slack
Notifications 2.0's load-bearing migration mechanism:
legacy
desktop: 'off'is translated at every read site todesktop: 'mentions'+desktop_push_enabled: false; stored bytes never change; rollback is byte-identical. Canonical alternative to database-level backfill when rollback safety dominates. - patterns/decouple-what-from-how-in-preferences — the schema-design pattern; split conflated content-selection
- delivery-channel enums into orthogonal axes. Slack's
legacy
desktop: everything | mentions | nothingbecamedesktop: everything | mentions(what) +desktop_push_enabled: bool(how). - patterns/unified-preference-model-for-cross-client-state — one canonical preference hierarchy across all clients with named override points for legitimate platform divergence (Slack's Advanced section = mobile-specific badge controls). Collapses N per-client models + sync layer into 1 + explicit overrides.
- concepts/read-time-preference-translation — the concept behind the pattern; translate legacy stored values to new-schema semantics at read time, leaving storage untouched.
- concepts/preference-schema-decoupling — Slack's what/how split canonicalised; separate content selection from delivery channel as orthogonal preference axes.
- concepts/cross-platform-preference-parity — single shared preference model across desktop / iOS / Android with explicit overrides where platforms legitimately diverge; the anti-pattern is per-client preference drift (Slack's pre-2026-03 state).
- concepts/mental-model-preference-coherence — the architectural discipline of making storage schema match user mental models so "setting X does Y" is a simple read, not a derivation. The test case: does a user's predicted-outcome survive changing a setting?
- concepts/support-burden-as-architecture-signal — sustained high ticket volume on one feature as a diagnostic for structural architecture failure, not UX polish. Slack: notifications were top-3 CX ticket driver for years; rebuild eliminated the category.
- concepts/explicit-state-over-implicit-sync —
storing independent explicit values (
desktop,mobile) beats one-value-plus-sync-derivation-rule. Slack verbatim: "Clarity beats cleverness. Removing the sync parameter and storing explicit desktop and mobile values made behavior predictable." - concepts/auto-save-modal-ux-coherence — auto-save as the UX discipline that matches fine-grained independently-toggleable settings; coherence fix against the other architectural choices, not a stand-alone UX preference.
Recent articles¶
- 2026-05-28 — sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud (Slack AI infrastructure team retrospective on the three-year evolution of the Slack AI LLM serving substrate from single-region AWS SageMaker (early 2023) → fully managed Amazon Bedrock PT (mid-2024) → Hybrid PT+OD with spillover (mid-2025) → multi-cloud Bedrock
- GCP Vertex AI (early 2026). Architectural endpoint: Slack's Intelligent Routing Layer with five subsystems (metric-driven model selection / A-B experimental routing / automated circuit breaker with partial-open recovery / API normalisation / secretless cross-cloud auth) hidden behind a unified internal API. Disclosed Phase 4 outcomes: ~10% quality lift on complex reasoning + ~67% latency reduction on high-velocity / low-token workloads via per-feature model-to-feature binding. Three-year arc canonicalises four structural failure modes that drove the evolution: model feature lag (Phase 1 → 2; verbatim "weeks or months before SageMaker availability" for Anthropic models on Bedrock vs SageMaker), over-provisioning cycle (Phase 2 → 3; "persistent efficiency gap" between US-coast peak and APAC/EU mornings; 10× peak/off-peak variance for OD-suitable workloads), commitment lock-in (Phase 2 → 3; verbatim "a state-of-the-art model can be superseded in weeks" but PT contracts run 1–6 months; post-OD migration cadence months → 1 day), concentration risk on single-cloud LLM serving (Phase 3 → 4; "no matter how many failovers we engineered within a single cloud, we remained susceptible to any potential provider-wide outage"). Phase 1 substrate: escrow VPC for Anthropic models on SageMaker, "a strict zero-knowledge environment: our data remained private to Slack, and the provider's proprietary model weights remained inaccessible to us". Phase 2 migration: zero-incident playbook — Compliance → Capacity → Quality → Rollout. Phase 3 hybrid: PT-with-OD-spillover verbatim — "if a sudden surge pushed us beyond our reserved limits, excess requests automatically 'spilled over' to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings". Phase 4 routing: per-feature model bindings + automated circuit breaker with partial-open recovery state verbatim — "the breaker enters a partial-open state, allowing a small, controlled trickle of requests to reach the degraded endpoint. As the endpoint demonstrates sustained health, the system dynamically expands this trickle, incrementally ramping traffic back up". Three named breaker triggers: TTFT, p90 latency, 5xx error rate (plus "downward trend in customer feedback" as a soft-signal trigger). Slack adopts the MU primitive on the customer side — sibling to Databricks' platform- side coining of the same primitive in the 2026-05-27 Reliable LLM Inference at Scale post; verbatim "Each MU provides a deterministic amount of throughput, measured in tokens per minute. Shifting from GPU instances to MUs allowed us to abstract away the hardware and focus entirely on raw throughput." Five reflections verbatim: "Scaling safely requires XFN parity / The abstraction layer is a core requirement / Treat architecture as a living document / Reliability requires provider agnosticism / Redefining the meaning of 'Failure'" — the last canonicalising soft-failures ("An LLM service that is 'up' but slow is effectively broken") as first-class routing-layer triggers. Four operational taxes of multi-cloud disclosed verbatim: API + behavioural friction, operational monitoring complexity, attribution challenge, on-call knowledge gap. Canonicalises 3 systems (systems/slack-ai, systems/slack-intelligent-routing-layer, systems/gcp-vertex-ai), 6 concepts (concepts/multi-cloud-llm-serving, concepts/escrow-vpc-llm-serving, concepts/llm-model-feature-lag, concepts/provisioned-throughput-vs-on-demand-llm, concepts/llm-over-provisioning-cycle, concepts/llm-provider-commitment-lock-in, concepts/api-normalization-multi-cloud-llm, concepts/model-to-feature-binding, concepts/concentration-risk-single-cloud-llm, concepts/automated-circuit-breaker-with-partial-open-state), and 5 patterns (patterns/multi-cloud-llm-serving, patterns/provisioned-throughput-with-on-demand-spillover, patterns/api-normalization-layer-cross-provider, patterns/model-fallback-hierarchy-with-circuit-breaker, patterns/zero-incident-llm-migration). Extends 4 pages: systems/aws-sagemaker-ai (Phase 1 escrow-VPC face), systems/amazon-bedrock (Phase 2/3 substrate face
-
PT/OD primitive disclosure + Slack customer-side MU adoption), concepts/model-units (customer-side adoption sibling to Databricks' platform-side coining), patterns/circuit-breaker (partial-open ramp refinement for LLM serving). Tenth Slack coverage axis on the wiki — the LLM serving infrastructure substrate one altitude above the consumer-facing Slack AI features (developer-productivity LLM codemod, security-investigation Spear, agent-context-engineering, etc.) all of which ride on this substrate. Tier-2 on-scope decisively: multi-cloud LLM serving architecture is genuine distributed-systems / scaling-trade-offs / production- capacity-management content. Sibling at the LLM-serving- substrate altitude to Databricks' Reliable LLM Inference retrospective (sources/2026-05-27-databricks-reliable-llm-inference-at-scale, same-week ingest) — same primitives (MUs, cost-based routing, fallback hierarchy), different altitude (Slack is the customer; Databricks is the platform). Caveats: no specific provider model SKUs / no dollar figures / PT-vs-OD savings characterised as "substantial" without concrete numbers / circuit-breaker thresholds qualitative / per-feature primary/backup bindings not enumerated / cross-cloud traffic share not disclosed / specific secretless-auth federation shape not specified / retrospective framing on a clean four-phase progression with no acknowledgement of false starts.)
-
2026-05-05 — sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines (Slack data-platform team retrospective on eliminating SSH entirely from EMR data pipelines — 700+ production jobs across 8 independent data regions, 7 operator types, 5 teams, 3 quarters, zero downtime. The architectural unblocker: a single REST gateway — Quarry — fronting YARN / Trino / Snowflake from one auth surface and one log schema; the architecture shift verbatim:
Airflow → SSH Connection → EMR Master Node → Execute Command→Airflow → Quarry REST API → YARN ResourceManager → EMR Container. The breakthrough enabler for the 300+ shell-class jobs (aws s3 sync,hadoop distcp, custom Python scripts) was discovering that YARN Distributed Shell — "a little-known feature […] already part of YARN" — could run arbitrary shell scripts in YARN containers via the standard YARN REST API, with no custom remote-execution service to build. Canonicalised as patterns/yarn-distributed-shell-as-universal-shell-executor. SSH's costs surfaced as structural blockers: "Couldn't start the path for Spark on Kubernetes nor EMR on EKS […] Couldn't complete our Whitecastle initiative because we needed to move the last main-account EMR clusters to child accounts." Two named challenges illuminate the latent failure modes SSH was hiding: vmem-check failures (jobs ran fine via SSH had been silently exceeding YARN's virtual-memory limits — fix: AWS-recommendedyarn.nodemanager.vmem-check-enabled: falsebecause "virtual memory accounting in Linux can be unreliable, and physical memory limits are sufficient" — canonicalised as concepts/resource-enforcement-bypass-via-ssh) and an EKM connectivity timeout during cross-cluster migration that "surfaced a hidden dependency on network topology that wasn't captured in the job's configuration" — fix: move jobs to clusters with the right routing rather than punching holes through the network. Honest disclosure on multi-region overhead: "effectively 8 parallel migrations, each with its own special set of challenges. […] The effort isn't just N times harder. It's N times harder with unique failure modes for each region." Disciplined rollout used patterns/incremental-operator-by-operator-migration (CrunchExecOperator, S3SyncOperator, etc., one at a time as mini-projects) over 5 phases — Proof of Concept → Security Review → OKR-Driven Execution (Phase 3 made the migration a Key Result with executive visibility; hit 80% milestone) → Bulk Migration → Final Cleanup — tracked via an analytics dashboard backed by Airflow metadata-DB queries identifying remaining SSH-based tasks per team / per DAG / per region. Verbatim best-practice: "Build monitoring before you migrate" — extends concepts/observability-before-migration to the project-progress altitude (Airflow-DB-backed burndown), sibling to the 2026-03-31 HTTP/3 transport-probe altitude. Operational improvements post-migration verbatim: "Master node resource contention: eliminated. […] Job reliability: dramatically improved. Jobs survive client Kubernetes pod restarts because Quarry maintains server-side job tracking. No more zombie processes." Security improvements verbatim: "replaced SSH key distribution with service-to-service token authentication, and gained proper audit trails through REST API logging. […] No more 'who ran that command?' mysteries." Two-year retrospective verdict: "With two years of production experience since completion, the architectural decisions have proven sound. […] No regrets." Canonicalises 3 systems (systems/slack-quarry, systems/yarn-distributed-shell, systems/apache-yarn), 4 concepts (concepts/rest-based-job-submission — the paradigm shift; concepts/ssh-job-execution-anti-pattern — what's being replaced; the two failure-mode concepts named above), and 3 patterns (patterns/rest-gateway-for-compute-engine-job-submission, patterns/yarn-distributed-shell-as-universal-shell-executor, patterns/incremental-operator-by-operator-migration). Extends concepts/attack-surface-minimization (with the "can't modernise anything until SSH dies" structural-blocker framing — complement to Meta's "feature gating" WhatsApp framing), concepts/long-lived-key-risk (industrial-scale elimination of the long-lived-SSH-key class via service-token replacement), and concepts/audit-trail (Quarry's per-submission structured logs as the new audit substrate). Caveats: no exact $-savings figure, no public Quarry open-source release, no Trino/Snowflake adapter internals disclosed, no quantified incident-rate delta, no token-rotation cadence at 700-jobs × 8-regions scale.) - 2026-04-13 — sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications
(Slack Security Engineering team, series part 2 to the
2025-12-01 Spear post. Canonicalises how
Spear manages context across
long-running multi-agent investigations that can "span
hundreds of inference requests and generate megabytes of
output". Core architectural claim:
three
complementary context channels replace raw message history
— Director's
Journal (typed working memory: decision / observation /
finding / question / action / hypothesis + phase + round +
timestamp; read by all agents, written only by Director),
Critic's Review with the
four-tool
introspection suite (
get_tool_call,get_tool_result,get_toolset_info,list_toolsets) and a 5-level credibility rubric scored against disclosed distribution of 170,000 findings (37.7% Trustworthy / 25.4% Highly-plausible / 11.1% Plausible / 10.4% Speculative / 15.4% Misguided — 25.8% sub-plausibility rate), and the Critic's Timeline that implements narrative- coherence assembly with four explicit consolidation rules, a second 5-level rubric, and top-3 gap identification across three gap types (evidential / temporal / logical). Canonical load-bearing claim on narrative coherence as hallucination filter: "A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with." Also: no message history carry-forward between invocations — "Besides these resources, we do not pass any message history forward between agent invocations" — justified both by token-budget and by cognitive-load arguments (over-sharing "stifles creativity and encourages confirmation bias" even with infinite context). Specimen investigation disclosed: 6,046-event false-positive kernel-module-loading alert with 0.83 Timeline confidence and 3 identified gaps.) - 2025-12-01 — sources/2025-12-01-slack-streamlining-security-investigations-with-agents
(Slack Security Engineering team retrospective on building
Spear, their internal multi-agent
security-investigation service that triages detection-system
alerts during on-call shifts. First post in a promised
series. Load-bearing architectural lesson verbatim:
"prompts are just guidelines; they're not an effective
method for achieving fine-grained control" — canonical
statement of the new
concepts/prompt-is-not-control concept after Slack's
300-word single-prompt prototype produced "wildly variable"
quality. Prototype rewrite moved control out of the prompt
into per-task model invocations with task-specific
structured-output schemas (patterns/one-model-invocation-per-task)
orchestrated by application code. Three-persona agent team
(patterns/director-expert-critic-investigation-loop):
Director (plans + progresses phases + writes final report;
uses a journaling tool), 4 Experts (Access /
Cloud / Code / Threat — each with unique tools + data
sources), Critic ("meta-expert" auditing Experts against
a rubric, scoring findings, condensing into a timeline).
Three phases (patterns/phase-gated-investigation-progression):
Discovery (broadcast to all Experts) → Trace (question one
Expert, may vary model parameters) → Conclude (final report),
with a Director-Decision meta-phase for transition decisions.
Knowledge pyramid model tiering
(concepts/knowledge-pyramid-model-tiering): verbatim
"low, medium, and high-cost models for the expert, critic,
and director functions, respectively." Tool-call-heavy
Expert work runs on cheap models; rubric-application +
condensation runs on mid-tier; strategic decisions on
top-tier. Service architecture
(patterns/hub-worker-dashboard-agent-service): Hub
(API + persistent storage + metrics endpoint) + Worker
(queue consumer, event-stream emitter, scalable) + Dashboard
(real-time observe + replay + per-invocation debugging)
replacing the prototype's coding-agent-CLI harness. Prototype
exposed data sources via an MCP stdio server; production
Worker's MCP persistence not explicitly disclosed.
Critic's weakly-adversarial stance
(concepts/weakly-adversarial-critic) verbatim: "The
weakly adversarial relationship between the Critic and the
expert group helps to mitigate against hallucinations and
variability in the interpretation of evidence." Canonical
worked example: the Critic caught a credential exposure
the Expert missed during a process-ancestry review, the
Director then pivoted the investigation to focus on the
credential issue, final report surfaced both the security
finding and the Expert's "analysis blind spots that require
attention." Verbatim: "What is notable about this result
is that the expert did not raise the credential exposure in
its findings; the Critic noticed it as part of its
meta-analysis of the expert's work." On-call shift
mode-shift: "we're switching to supervising investigation
teams, rather than doing the laborious work of gathering
evidence." 9 canonical wiki primitives: source + 1
system (systems/slack-spear) + 4 concepts
(concepts/knowledge-pyramid-model-tiering,
concepts/investigation-phase-progression,
concepts/weakly-adversarial-critic,
concepts/prompt-is-not-control) + 4 patterns
(patterns/director-expert-critic-investigation-loop,
patterns/one-model-invocation-per-task,
patterns/hub-worker-dashboard-agent-service,
patterns/phase-gated-investigation-progression).
Extends 5 pages:
systems/model-context-protocol (new Seen-in — first wiki
MCP instance inside an internal security-investigation
pipeline),
concepts/structured-output-reliability (new Seen-in —
structured output as multi-agent orchestration-boundary
contract),
patterns/specialized-agent-decomposition (new Seen-in —
canonical security-operations instance at the peer-Expert
layer, with supra-agent Director + meta-agent Critic on top),
patterns/multi-round-critic-quality-gate (new Seen-in —
live-investigation altitude variant, distinguished from
Meta's artifact-production rounds shape),
patterns/drafter-evaluator-refinement-loop (new Seen-in —
investigation-loop-with-third-layer variant, distinguished
from Lyft's retry-only shape by the Director who decides
progress/pivot/conclude). Scope disposition: Tier-2
on-scope decisively on multi-agent-architecture-
canonicalisation grounds. Opens the Slack security-
engineering axis on the wiki; first Slack security-
operations ingest; seventh Slack coverage axis after
developer-productivity, deploy-safety, test-framework-
integration, mobile-a11y, fleet-config-management,
build-systems, and edge-networking. Zero production
numbers disclosed (no throughput / latency / cost / FP
rate / token-usage); model families not disclosed;
Critic's rubric opaque; Spear name inferred from image
slugs not stated in post body. URL verbatim:
https://slack.engineering/streamlining-security-investigations-with-agents/. Sibling to Cloudflare AI Code Review (patterns/coordinator-sub-reviewer-orchestration) at the code-review altitude; sibling to Redpanda Openclaw (patterns/four-component-agent-production-stack) at the enterprise-agent-substrate altitude; sibling to Lyft localization (patterns/drafter-evaluator-refinement-loop) at the structured-retry-loop altitude.) - 2026-03-19 — sources/2026-03-19-slack-how-slack-rebuilt-notifications
(Slack Engineering retrospective on the Notifications
2.0 project — a ground-up redesign of Slack's
notification preference system that migrated millions of
users from four conflicting preference models to one
unified hierarchy without a database-level backfill.
The load-bearing mechanism is
read-time schema
translation: a new
desktop_push_enabledboolean was added and backfilled from the legacy "off" value; at every read site, legacydesktop: 'off'is translated todesktop: 'mentions'+desktop_push_enabled: false— behaviour stays byte-identical, but now expressible in the new decoupled schema. Stored bytes never changed, so rollback is byte-identical safe. Canonical verbatim on why the database-backfill path was rejected: "With backwards compatibility and the possibility of rollback in mind, we thought it too risky to move people from 'off' to 'mentions' at the database level." The schema itself canonicalises decoupling what from how — the legacy enum conflated content selection (everythingvsmentionsvsnothing) with delivery channel (push on/off) into one axis; new schema splits intodesktop: everything | mentions(what) +desktop_push_enabled: bool(how). Canonicalises one canonical preference hierarchy across clients with the Advanced section as the named override dimension for mobile-specific badge controls. Top-3 CX ticket driver pre-project canonicalises support burden as architecture signal: sustained high ticket volume concentrated on one feature is a diagnostic for structural architecture failure, not UX polish. Modal redesign switched from explicit Save button to auto-save — canonicalises auto-save as UX coherence for fine-grained independently-toggleable settings. Legacy "sync" parameter explicitly removed in favour of independent explicit values — canonical instance of concepts/explicit-state-over-implicit-sync: "Clarity beats cleverness. Removing the sync parameter and storing explicit desktop and mobile values made behavior predictable." Post-launch data: 5× settings-page engagement sustained for weeks (not one-time curiosity, active ongoing preference refinement); majority chose "Mentions and DMs" default; per-channel overrides decreased; push-toggle adoption "higher than expected". One disclosed production incident during migration: "A malformed field once reset preferences to Mentions until we cleaned data and flushed memcache" — surfaces the thin-validation + aggressive-caching + default-value-fallback compound failure shape of heavily-cached preference stores. Opens Slack's 8th coverage axis on the wiki: preference- architecture / notification engineering at scale.) - 2026-03-31 — sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness
(Slack edge-networking team retrospective on rolling out
HTTP/3 on the public edge and closing the
HTTP/3 probing gap first.
Existing SaaS observability vendors had zero native
HTTP/3 probe support; Slack's own
Prometheus Blackbox
Exporter fleet ("a cornerstone of our monitoring") was
TCP-shaped and couldn't speak QUIC/UDP. Intern Sebastian
Feliciano scoped, implemented, and open-sourced an
http3module into BBE upstream built on systems/quic-go — selected for "wide adoption across other open source technologies, as well as the first-class support it provides in creating http clients in go." Integration snippet:http3Transport := &http3.Transport{TLSClientConfig: tlsConfig, QUICConfig: &quic.Config{}}wrapped inhttp.Client. Architectural discipline that earned the upstream merge: "had to add this new logic while following the Blackbox Exporter's existing architecture, ensuring the new features maintained the tool's configuration patterns." Because internship timelines ≠ OSS merge timelines, Sebastian "took matters into his own hands and architected an in- house system that utilized the new upstream features for probing out HTTP/3 endpoints" — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics unified in Grafana ("single pane of glass") - reliable HTTP/3 alerts + easier telemetry correlation.
New canonical pages (5): 1 system
(systems/prometheus-blackbox-exporter), 3 concepts
(concepts/client-side-black-box-probe,
concepts/http-3-probing-gap,
concepts/observability-before-migration), 1 pattern
(patterns/upstream-contribution-parallel-to-in-house-integration).
Extended: systems/quic-go (second wiki
instance at upstream-tooling altitude, distinct from the
PlanetScale HTTP/3 driver benchmark instance),
systems/prometheus (BBE axis added to the Airbnb-
observability-ingestion-dominated page), systems/grafana
(single-pane-of-glass unified multi-HTTP-version view),
concepts/http-3 (probing-gap framing added alongside
the existing Cloudflare latency framing),
concepts/observability (observability-as-migration-gate
altitude added), patterns/upstream-fixes-to-community
(upstream-a-whole-new-feature altitude added alongside the
existing fix-at-scale altitude). Scope takeaways
verbatim: "Monitor first, and migrate second. ... getting
observability right as a precursor to migration makes
everything faster"; "Contributing open source pays
dividends"; "Bet on your interns." Opens the Slack
edge-networking-and-synthetic-monitoring axis on the wiki;
sixth Slack coverage axis (after developer-productivity,
deploy-safety, test-framework-integration, mobile-a11y,
fleet-config-management, and build-systems). Tier-2 on-
scope decisively: real engineering retrospective with
architecture diagram, code snippet, upstream-PR reference,
specific library selection rationale, and explicit
migration-gate framing. Operational numbers thin (article
is about monitoring, not HTTP/3 edge perf): hundreds of
thousands of HTTP/3 endpoints to probe; zero SaaS vendors
supporting HTTP/3 out-of-box at investigation time; code
merged to BBE at pinned commit
bee8e9102a106bff63281ee9c64c7b1275ef21d0. URL verbatim:https://slack.engineering/from-custom-to-open-scalable-network-probing-and-http-3-readiness-with-prometheus/.) - 2025-11-06 — sources/2025-11-06-slack-build-better-software-to-build-software-better (Slack Quip/Canvas team retrospective on taking their build from 60 minutes to as low as 10 minutes (~6× speed-up) by adopting Bazel + Starlark and doing the unglamorous engineering work to benefit from them. Load-bearing insight: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — the pre-existing build had cycles, non-hermetic action nodes, and cache keys with hundreds of parameters, giving a zero cache hit rate that no build tool could fix. Two concrete wins: severing the Python↔TypeScript dependency edge (saved ~35 min/build on its own; canonical patterns/decouple-frontend-build-from-backend-artifacts) and deleting in-process parallelization inside the frontend bundler so Bazel could schedule across bundles (canonical patterns/delete-inner-parallelization-inside-outer-orchestrator). Correctness oracle during refactor was a Rust byte-diff tool comparing old- and new-build artifacts (canonical patterns/diff-artifact-validator-for-build-refactor). Key new concepts: concepts/cache-granularity ("100 parameters, 2-3 always change" failure mode), concepts/idempotent-build-action (pre-refactor build mutated the working directory), concepts/layering-violation (frontend bundler fused business logic + orchestration + parallelization), concepts/separation-of-concerns applied across backend/frontend, Python/TypeScript, and application/build-code axes. Extends systems/bazel, concepts/hermetic-build, concepts/build-graph, concepts/cache-hit-rate with the zero-hit-rate structural-failure story. Operational numbers: 60 min → 10 min (best-case, cached+parallelised) / 12 min (average) / 30 min (cache miss). Opens the Slack build-system- engineering axis on the wiki; first Slack build-tooling ingest.)
- 2025-11-19 — sources/2025-11-19-slack-android-vpat-journey
(Slack Android team retrospective on triaging a 2024 third-
party VPAT audit conducted after the IA4 UI redesign;
8 recurring accessibility themes identified (7 resolved,
1 deferred as future work). Representative-of-all-four-
buckets worked example of [[patterns/vpat-driven-a11y-
triage]]: shovel-ready fixes assigned immediately; recurring
themes resolved at Slack Kit component-
library layer (
OutlinedTextFielderror-announcement,SKBannererror auto-announce,SKListAccessibilityDelegateoverridesCollectionInfoto exclude decorativeSK divideritems from TalkBack's "N items in a list" count — canonical instance of patterns/accessibility-delegate-override-for-semantic-fix; workspace-switcher drag-and-drop via Edit-mode + six-dot handle + TalkBack custom actions "Move before" / "Move after" invoked by three-finger tap orL/rdrawing gestures — canonical instance of patterns/custom-talkback-actions-as-gesture-alternative); platform-convention closures for top-app-bar-as-heading (WCAG 1.3.1; verified via native Google apps) and strikethrough-announcement (WCAG 1.3.1; verified via blind- community consultation) — canonical instances of concepts/wcag-platform-applicability-gap; error-icon redundancy resolving the colour-alone P3 theme — concepts/redundant-error-signalling (WCAG 1.4.1); scope-bounded deferral of WCAG 2.1.1 keyboard-nav because Slack Android has no tablet support. New canonical pages: source + 4 systems (systems/slack-android, systems/slack-kit, systems/talkback, systems/android-accessibility-framework) + 3 concepts (concepts/vpat-voluntary-product-accessibility-template, concepts/wcag-platform-applicability-gap, concepts/redundant-error-signalling) + 3 patterns (patterns/accessibility-delegate-override-for-semantic-fix, patterns/custom-talkback-actions-as-gesture-alternative, patterns/vpat-driven-a11y-triage). Extended: concepts/automated-vs-manual-testing-complementarity (new Seen-in: this post is the manual / third-party / periodic layer complementing the 2025-01-07 automated Axe-in-Playwright ingest). Opens the fourth axis of Slack coverage on the wiki: mobile accessibility engineering. Tier-2 borderline include on mobile-a11y-pattern- canonicalisation grounds; user explicit full-ingest override of prior batch-skip; same disposition as the 2025-01-07 and 2024-06-19 Slack developer-productivity ingests. No incident / latency / scale disclosures; pure engineering-process retrospective. URL verbatim:https://slack.engineering/android-vpat-journey/.) - 2025-10-23 — sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption
(Archie Gunasekara's follow-up to Slack's 2024 Advancing
Our Chef Infrastructure post describes phase two of
Slack's EC2 / Chef deploy-safety work: instead of migrating
to Chef Policyfiles (which
would have required every service team to rewrite roles,
environments, and cookbooks), Slack extended the existing
EC2 framework in two load-bearing changes. (1) Splitting
one production Chef environment into six AZ-bucketed
environments (
prod-1…prod-6) rolled out via a release train withprod-1as canary:prod-1receives new versions every hour (hot canary);prod-2→prod-6advance via release train, with the next version gated on the previous version completing the train. Why the canary is parallel rather than head-of-train: "updatingprod-1frequently with the latest version allows us to detect issues closer to when they were introduced" rather than testing cumulative-change artifacts at the canary. Boot- time mapping from AZ to environment via Poptart Bootstrap (Slack's cloud-init-phase AMI tool) ensures newly provisioned nodes inherit the AZ-bucket boundary from instance 0 — the explicit fix for the scale-out-picks-up-bad-config failure mode that per-node cron staggering didn't address. (2) Replacing cron-driven Chef runs with a signal-driven pull model via a new service called Chef Summoner that watches an S3 bucket populated by the existing Chef Librarian artifact- promotion service atchef-run-triggers/<stack>/<env>. Signal payload carriesSplay(randomised jitter),Timestamp, and a fullManifestRecord(artifact version, cookbook-versions map, S3 artifact pointer,upload_completeordering flag). Summoner deduplicates against local state (last-run-time + artifact-version), applies Splay, and triggerschef-client. Plus a fallback cron baked into every AMI that independently triggerschef-clientif Summoner hasn't run Chef in the last 12 hours — the recovery path for broken-Summoner deployments. Also enforces the 12-hour compliance SLA. Closes by marking the legacy EC2 platform feature-complete + maintenance-mode and previews a brand-new EC2 successor called Shipyard (service-level deployments + metric-driven rollouts + automated rollbacks) for teams that can't yet move to Bedrock. Canonical wiki primitives: 6 new systems (systems/chef + systems/chef-policyfiles + systems/chef-librarian + systems/chef-summoner + systems/poptart-bootstrap + systems/slack-shipyard), 6 new concepts (concepts/az-bucketed-environment-split + concepts/splay-randomised-run-jitter + concepts/signal-driven-chef-trigger + concepts/s3-signal-bucket-as-config-fanout + concepts/fallback-cron-for-self-update-safety + concepts/cookbook-artifact-versioning), 4 new patterns (patterns/split-environment-per-az-for-blast-radius + patterns/release-train-rollout-with-canary + patterns/signal-triggered-fleet-config-apply + patterns/self-update-with-independent-fallback-cron). Extends concepts/blast-radius with the fleet- configuration substrate altitude and systems/aws-s3 with the S3-as-config-fanout-bus altitude (previously canonicalised as object-store + CDC-log-store + tiered- cold-tier). Natural companion to the 2025-10-07 Deploy Safety retrospective — one level below the program-altitude framing, drilling into the EC2 / Chef substrate specifically. Tier-2 on-scope decisively: real distributed-systems internals, scaling-trade-off rationale, concrete operational disclosures (6 production environments / 12-hour compliance SLA / hourly promotions). Opens the Slack fleet-configuration-management axis.) - 2025-10-07 — sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change
(Retrospective on Slack's 18-month Deploy Safety Program —
90% reduction in customer impact hours from
change-triggered incidents (Feb-Apr 2024 peak → Jan 2025).
Load-bearing framing statistic: 73% of customer-facing
incidents were Slack-induced-change-triggered, particularly
code deploys. Three North Star goals across all deployment
systems for highest-importance services: 10-min automated
MTTR / 20-min manual MTTR / detect before 10% fleet
exposure. Canonical program metric: "Hours of customer
impact from high-severity and selected medium-severity
change-triggered incidents" — explicitly framed as
"imperfect analog of customer sentiment" sitting in a
three-layer chain
Customer sentiment <-> Program Metric <-> Project Metric. Five-axis investment strategy: invest widely + bias for action / known-pain first / invest further based on results / curtail least impactful / flexible roadmap — explicit framing that below-expectation projects are "not failures" but "critical input." Phase-change architectural shift: "Once automatic rollbacks were introduced we observed dramatic improvement in results." Composed Webapp-backend investment sequence (Q1 metric monitoring → Q2 manual rollback via automatic alerts → Q3-Q4 automatic rollback → Q4+ ≤10 min customer impact → pattern copied to Webapp frontend → centralised deployment orchestration system inspired by [ReleaseBot] + AWS Pipelines). Trailing-metric-patience discipline with 3-6 month delivery-to-impact lag; mid-stream sub-signals. Tool fluency discipline — "Use the tooling often, not just for the infrequent worst case scenarios." Direct-per-team- outreach discipline — "Not all teams and systems are the same." Canonical wiki primitives: (1) systems/slack-deploy-safety-program (new) — the program as a named wiki system. (2) systems/slack-releasebot + systems/slack-bedrock (new) — the named ReleaseBot inspiration + Bedrock substrate as stub pages. (3) concepts/change-triggered-incident-rate (new) — the justifying statistic. (4) concepts/customer-impact-hours-metric (new) — the program-metric-as-sentiment-analog choice. (5) concepts/pre-10-percent-fleet-detection-goal (new) — blast-radius-cap at fleet-level. (6) concepts/trailing-metric-patience (new) — the patience discipline. (7) patterns/automated-detect-remediate-within-10-minutes (new) — the 10-min/20-min MTTR pair. (8) patterns/centralised-deployment-orchestration-across-systems (new) — the multi-substrate deploy-orchestrator pattern. (9) patterns/invest-widely-then-double-down-on-impact (new) — the five-axis strategy. Extends patterns/fast-rollback with the fully-automated altitude variant, concepts/feedback-control-loop-for-rollouts with the organisational-altitude instance, concepts/blast-radius with the fleet-percentage rollout-gate altitude, concepts/dora-metrics with the "maintain development velocity" co-equal North Star, concepts/observability with the Q1-first-investment framing. Opens the Slack reliability-engineering axis on the wiki; first Slack production-reliability ingest.) - 2025-01-07 — sources/2025-01-07-slack-automated-accessibility-testing-at-slack
(Slack Frontend Test Frameworks team retrospective on
integrating Axe Core into Slack's
pre-existing Playwright E2E framework
as a custom-fixture extension. Two failed integration attempts
first: baking Axe into RTL's
render(blocked by Slack's customised Jest setup) and baking Axe into Playwright'sLocatorinteraction methods (blocked by Locator auto-wait semantics). Landing architecture:slack.utils.a11y.runAxeAndSaveViolations()on the pre-existing customslackfixture, invoked explicitly at page-ready. Canonicalises five reusable patterns: fixture- extension as integration surface for cross-cutting testing concerns, two-axis exclusion list (ticketed-known-issue + out-of-scope-by- design), severity- gated reporting (criticalonly at launch,serious/moderate/mildas future work), tri-mode opt-in test execution (A11Y_ENABLEflag composes on-demand local + scheduled nightly Buildkite + opt-in CI gate), alert- channel-to-Jira auto-ticket workflow (Slack alert channel spins up pre-populated Jira ticket with canonical label + Epic placement). Four new concepts: concepts/wcag-2-1-a-aa-scope, concepts/automated-vs-manual-testing-complementarity, concepts/playwright-locator-auto-wait, concepts/severity-filtered-violation-reporting. Three new systems: systems/axe-core, systems/axe-core-playwright, systems/jira. 91 tests in initial suite, non-blocking, WCAG 2.1 A+AA scope,criticalimpact only. Borderline Tier-2 ingest — developer-productivity rather than distributed-systems internals; same disposition as the 2024-06-19 Enzyme→RTL codemod ingest; the canonicalised patterns generalise to any automated-check-integration-into- existing-test-suite problem.) - 2024-08-26 — sources/2024-08-26-slack-unified-grid-how-we-re-architected-slack-for-our-largest-customers (Slack's 2024-08-26 retrospective on the Unified Grid project — a decade-after-launch re-architecture of the Slack client and backend's fundamental tenant-scoping assumption, shifting from workspace-centric to org-wide.)
- 2024-06-19 — sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library
(Retrospective on migrating 15,000+ Enzyme tests to React
Testing Library using an AST + LLM hybrid pipeline. AST-only
hit ~45% ceiling; LLM-only (Claude 2.1) was 40-60% with
wild variance; hybrid reached ~80% on selected evaluation
files. Per-test-case DOM captured by instrumenting Enzyme
mount/shallow. In-code annotation comments from AST pass shaped LLM output. Tool later open-sourced as@slack/enzyme-to-rtl-codemod. ~64% adoption across Slack's RTL migration; 338-file CI-nightly run produced ~500 auto- passing test cases (~22% documented developer-time savings, lower-bound.))