Slack¶

Slack Engineering blog (slack.engineering). Tier-2 source on the sysdesign-wiki. Slack is a workplace-messaging platform (acquired by Salesforce in 2020) with substantial engineering output across backend (Flannel, Vitess-for-Slack), mobile (cross-platform client architecture), frontend (TypeScript/ React at large scale, shared-edit collaboration), and developer infrastructure (CI, test frameworks, migration tooling).

This wiki's coverage of Slack spans seven axes so far:

Developer-productivity tooling at scale — Slack's public retrospective on using LLMs to automate a 15,000-test Enzyme → RTL migration, which canonicalised a reusable AST + LLM hybrid conversion pattern.
Reliability engineering at scale — Slack's 18-month Deploy Safety Program (mid-2023 → Jan 2025) that reduced customer impact hours from change-triggered incidents by 90%, canonicalised in the 2025-10-07 retrospective.
Test-framework integration at scale — Slack's 2022- launched integration of Axe Core accessibility checks into the existing Playwright E2E suite as a custom- fixture extension, canonicalising several reusable patterns (fixture-extension as integration surface, two-axis exclusion-list, severity-gated reporting, tri-mode opt-in execution, alert-to-Jira workflow).
Mobile accessibility engineering — Slack's 2024 third- party VPAT audit of the IA4-redesigned Android client; 8 recurring themes (7 resolved, 1 deferred) with fixes concentrated at Slack Kit component- library layer. Canonicalised the patterns/accessibility-delegate-override-for-semantic-fix pattern (Slack's new SKListAccessibilityDelegate fixing CollectionInfo for decorative dividers), the patterns/custom-talkback-actions-as-gesture-alternative pattern (workspace-switcher drag-reorder via TalkBack "Move before" / "Move after" actions + six-dot drag handle + Edit mode), and the patterns/vpat-driven-a11y-triage four-bucket workflow. Complements the 2025-01-07 automated-Axe ingest as the manual / third-party / periodic layer of the same broader a11y strategy (see concepts/automated-vs-manual-testing-complementarity).
Fleet-configuration-management at scale — Slack's 2025-10-23 Chef phase-2 post canonicalises the AZ-bucketed environment split, the signal-driven fleet-config-apply pipeline (Chef Librarian → S3 → Chef Summoner), the release-train-with-canary rollout, and the self-update- with-independent-fallback-cron pattern — one-level-below the Deploy Safety Program's altitude, in the EC2 / Chef substrate specifically.
Build-system engineering at scale — Slack Quip/Canvas team's 60 min → 10 min (~6×) build speed-up via Bazel + Starlark adoption, canonicalised in the 2025-11-06 retrospective. The load-bearing insight: Bazel gives nothing to a build whose graph has cycles / non-hermetic actions / coarse cache keys; the real wins came from applying classical separation-of-concerns and layering principles to build code itself. Canonicalised three new patterns — patterns/decouple-frontend-build-from-backend-artifacts, patterns/delete-inner-parallelization-inside-outer-orchestrator, patterns/diff-artifact-validator-for-build-refactor — plus the concepts/cache-granularity and concepts/idempotent-build-action concepts.
Edge-networking and synthetic monitoring — Slack's 2026-03-31 retrospective on rolling out HTTP/3 on the public edge. Closed the HTTP/3 probing gap — neither SaaS observability tools nor Slack's Prometheus Blackbox Exporter fleet ("a cornerstone of our monitoring") natively spoke QUIC/UDP before the intern project. Intern Sebastian Feliciano scoped, implemented, and open-sourced QUIC support into Prometheus Blackbox Exporter upstream using systems/quic-go as the client library, then built an in-house integration on the same branch so Slack could ship HTTP/3 probing to production before upstream merge — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Extends the patterns/upstream-fixes-to-community pattern with a new altitude: upstream-a-whole-new-feature, distinct from the Shopify × Reanimated fix-existing- feature-at-scale altitude. Canonicalised Slack's explicit "monitor first, migrate second" takeaway as the concepts/observability-before-migration concept. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics in one Grafana "single pane of glass".
Security-engineering + AI-agent operations — Slack Security Engineering team's 2025-12-01 retrospective on Spear, their multi-agent security-investigation service that triages detection- system alerts during on-call shifts. Opens the first post in a promised series. Canonical first wiki instance of the patterns/director-expert-critic-investigation-loop pattern (three-persona agent team: Director plans / progresses phases / writes final report, four Experts — Access / Cloud / Code / Threat — produce domain-specific findings, Critic audits + condenses). Canonical first wiki instance of the concepts/knowledge-pyramid-model-tiering concept (Experts on cheap models, Critic on mid-tier, Director on top-tier). Canonical first wiki instance of the patterns/hub-worker-dashboard-agent-service pattern (Hub + Worker + Dashboard productisation shape). Load-bearing architectural lesson verbatim: "prompts are just guidelines; they're not an effective method for achieving fine-grained control" — canonical statement of concepts/prompt-is-not-control. Slack's canonical emergent-behaviour worked example: Critic caught a credential exposure the Expert missed, Director pivoted the investigation. Canonicalised four concepts (concepts/knowledge-pyramid-model-tiering, concepts/investigation-phase-progression, concepts/weakly-adversarial-critic, concepts/prompt-is-not-control) + four patterns (patterns/director-expert-critic-investigation-loop, patterns/one-model-invocation-per-task, patterns/hub-worker-dashboard-agent-service, patterns/phase-gated-investigation-progression) + one system (systems/slack-spear).

Key systems¶

systems/slack-deploy-safety-program — the 18-month reliability program; 90% reduction in customer impact hours; 10-min automated MTTR / 20-min manual MTTR / detect-before-10%- fleet North Stars; "imperfect analog of customer sentiment" program metric; exec-sponsored; OKR-weighted.
systems/slack-releasebot — Slack's 2018-era metrics-based- deploy + automatic-rollback orchestrator for Webapp backend; the blueprint the 2023-2025 centralised orchestration system generalises across substrates.
systems/slack-bedrock — Slack's internal compute platform over Kubernetes; the substrate on which the first fully- automated metrics-based-deploy-with-auto-rollback regime was built.
systems/slack-spear — Slack's multi-agent security- investigation service that triages detection-system alerts during on-call shifts. Three-persona agent team (Director / 4 Experts / Critic) running across three phases (discovery / trace / conclude) on a Hub + Worker + Dashboard service architecture. Name inferred from image-asset URL slugs; post uses "our service".
systems/slack-shipyard — Slack's upcoming successor to the legacy Chef-based EC2 platform. Preview-only in the 2025-10-23 post; service-level deployments, metric-driven rollouts, fully-automated rollbacks; soft launch Q4 2025 with two pilot teams. For teams that can't yet move to Bedrock.
systems/chef / systems/chef-librarian / systems/chef-summoner / systems/poptart-bootstrap / systems/chef-policyfiles — Slack's legacy EC2 fleet- configuration substrate and its phase-2 components (the Policyfiles alternative was explicitly rejected on blast- radius-of-change grounds).
systems/enzyme-to-rtl-codemod — Slack's open-sourced (@slack/enzyme-to-rtl-codemod) AI-powered test-conversion pipeline: AST codemod handles deterministic Enzyme→RTL rewrites + writes in-code annotation comments for hard cases, rendered DOM is captured per-test- case by instrumenting Enzyme's mount/shallow, and an LLM (Anthropic Claude 2.1 at time of post) consumes the annotated file + DOM + structured prompt to finish the conversion. ~80% quality on evaluation files; ~64% adoption across Slack's RTL migration.
systems/enzyme — testing framework Slack was migrating away from (no React 18 adapter).
systems/react-testing-library — target framework.
systems/claude-2-1 — LLM backend used in the original pipeline (via Slack's internal DevXP AI team integration).
systems/jest — underlying test runner.
systems/playwright — Slack's E2E framework. Used as the integration substrate for Axe accessibility checks via the custom-fixture extension pattern.
systems/axe-core / systems/axe-core-playwright — Deque Systems' accessibility rule engine and its Playwright binding. Slack's 2022-launched integration into the Playwright E2E suite.
systems/buildkite — Slack's CI orchestrator; hosts the nightly a11y regression run (one leg of Slack's tri-mode opt-in execution pattern).
systems/jira — Slack's ticket-tracking tool; receives auto-created tickets from the Slack a11y alert-channel workflow (patterns/alert-channel-to-jira-auto-ticket-workflow).
systems/slack-android — Slack's native Android client; anchors the 2024 VPAT retrospective's mobile-a11y disclosures. Phone-only (no large-form-factor support yet).
systems/slack-kit — Slack's shared mobile component library (SK). Components surfaced on the wiki so far: OutlinedTextField, SKBanner, SKList / SKListAdapter, SK divider, and the SKListAccessibilityDelegate introduced by the VPAT resolution.
systems/talkback — Android's built-in screen reader; the primary assistive-tech target for Slack Android's a11y work.
systems/android-accessibility-framework — Android platform a11y APIs (AccessibilityDelegate, AccessibilityNodeInfoCompat, CollectionInfo, custom- action API) that the Slack Kit fixes plug into.
systems/prometheus-blackbox-exporter — "a cornerstone of our monitoring" at Slack; canonical client-side black- box probing substrate for edge endpoints. Slack intern Sebastian Feliciano open-sourced the http3 module into this upstream project to close the HTTP/3 probing gap, built on systems/quic-go.
systems/quic-go — Go QUIC library Slack's BBE http3 contribution is built on; "wide adoption across other open source technologies, as well as the first- class support it provides in creating http clients in go" was the selection rationale.
systems/prometheus — the TSDB backing Slack's edge-probing metrics pipeline.
systems/grafana — Slack's "single pane of glass" dashboarding layer; canonically unifies HTTP/1.1 + HTTP/2
HTTP/3 probe metrics side-by-side post-BBE-contribution.
systems/enterprise-grid / systems/unified-grid / systems/slack-rtm — Slack's tenancy substrate before / after / pre-requisite for the 2024 Unified Grid re-architecture.
systems/slack-quip / systems/slack-canvas — the collaborative-document + in-Slack canvas surfaces whose shared Python-backend + TypeScript-frontend build pipeline is the subject of Slack's 2025-11-06 build-system retrospective (60 min → 10 min).
systems/bazel / systems/starlark — the build system and its constrained DSL adopted by the Quip/Canvas team; central to the 60 min → 10 min story. Slack's framing: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — adopting Bazel alone achieves nothing; the engineering work to meet its preconditions is where the speed-up lives.

Key patterns / concepts¶

patterns/automated-detect-remediate-within-10-minutes — Slack's canonical 10-min-auto / 20-min-manual MTTR pair as deployment-safety North Star; paired with the customer 10-minute-disruption threshold.
patterns/centralised-deployment-orchestration-across-systems — the architectural generalisation of Slack ReleaseBot's Webapp-backend-only automation to Webapp frontend, infra, EC2, Terraform.
patterns/invest-widely-then-double-down-on-impact — the five-axis investment strategy for data-scarce trailing-metric reliability programs; explicit framing that below-expectation projects are "not failures".
concepts/customer-impact-hours-metric — the program-metric choice as "imperfect analog of customer sentiment"; the three-layer sentiment↔program-metric↔project-metric chain.
concepts/change-triggered-incident-rate — 73% of customer-facing incidents were change-triggered at program start; the statistic that justified the program.
concepts/pre-10-percent-fleet-detection-goal — canonical "detect before 10% of fleet" blast-radius-cap goal.
concepts/trailing-metric-patience — 3-6 month lag from project delivery to impact visibility; mid-stream sub-signals discipline.
patterns/split-environment-per-az-for-blast-radius — Slack's 2025-10-23 Chef phase-2 pattern: split one prod Chef environment into six AZ-bucketed environments (prod-1 … prod-6) with boot-time mapping via Poptart Bootstrap. Bounds the scale-out-picks-up-bad-config failure mode.
patterns/release-train-rollout-with-canary — Slack's prod-1 hourly canary + prod-2 → prod-6 strictly sequential release train; canary is parallel to the train, not at its head, so it tests incremental diffs rather than cumulative ones.
patterns/signal-triggered-fleet-config-apply — Slack's Chef Librarian → S3 → Chef Summoner pipeline that replaced fixed-cron Chef runs; canonical first wiki instance of S3 as a config-management signal bus.
patterns/self-update-with-independent-fallback-cron — Slack's solution to the self-update paradox: Chef Summoner updates itself via Chef runs, and a 12-hour fallback cron baked into every AMI triggers Chef runs independently if Summoner hasn't in the last 12 hours.
concepts/az-bucketed-environment-split — the concept behind the AZ-bucketed pattern.
concepts/signal-driven-chef-trigger — the concept behind the signal-triggered pattern.
concepts/s3-signal-bucket-as-config-fanout — the concept behind S3 as a config-fanout substrate.
concepts/splay-randomised-run-jitter — thundering-herd mitigation for signal-driven fleet-config runs.
concepts/fallback-cron-for-self-update-safety — the concept behind the fallback-cron pattern.
concepts/cookbook-artifact-versioning — the Chef- ecosystem rollout unit.
patterns/ast-plus-llm-hybrid-conversion — compose an AST pre-pass with an LLM for deterministic code transformation; Slack lifted conversion quality from 40-60% (pure LLM) to ~80% (hybrid).
patterns/in-code-annotation-as-llm-guidance — AST pass writes comments next to hard call sites to shape LLM attention, instead of stuffing everything into the system prompt. Slack: "we successfully minimized hallucinations and nonsensical conversions from the LLM".
concepts/abstract-syntax-tree — canonical role both as conversion primitive and as hallucination-control primitive for LLM pipelines.
concepts/llm-hallucination — Slack's named failure mode for pure-prompt-based code conversion.
concepts/llm-conversion-hallucination-control — the structural problem class this post articulates.
concepts/dom-context-injection-for-llm — capturing per-test-case rendered DOM via Enzyme render-method instrumentation, injecting into the LLM prompt.
patterns/a11y-checks-via-playwright-fixture-extension — canonical integration surface for adding cross-cutting testing concerns (a11y, perf, visual regression) to an existing Playwright suite via the custom-fixture model. Slack's 2022-launched Axe integration is the canonical instance.
patterns/exclusion-list-for-known-issues-and-out-of-scope-rules — two-axis exclusion list (known-issue-ticketed + out-of-scope-by-design) applied as pre-audit filter so automation signal stays high.
patterns/severity-gated-violation-reporting — report only highest-severity bucket at new-system launch; defer lower severities to explicit future work. Canonical anti- alert-fatigue rollout lever.
patterns/tri-mode-opt-in-test-execution — single default-off environment flag composing three execution modes (on-demand local + scheduled nightly Buildkite + opt-in CI gate) for new test classes.
patterns/alert-channel-to-jira-auto-ticket-workflow — violation output terminates in pre-populated Jira ticket rather than free-form alert; the alert channel becomes the ticket-creation UI.
concepts/wcag-2-1-a-aa-scope — WCAG 2.1 A+AA as the conventional scope-picker for automated a11y; expressed in Axe via wcag2a / wcag2aa / wcag21a / wcag21aa tag set.
concepts/automated-vs-manual-testing-complementarity — automated testing is a layer in a broader strategy, not a substitute for manual testing (especially in a11y where screen-reader UX requires human judgment).
concepts/playwright-locator-auto-wait — Playwright's Locator auto-wait guarantees element-level readiness but not whole-page readiness; this is why whole-page audits (a11y, visual regression, perf snapshots) can't be embedded into Locator interaction methods.
concepts/severity-filtered-violation-reporting — the concept behind patterns/severity-gated-violation-reporting; narrow-filter-at-launch + widen-as-calibrated.
patterns/accessibility-delegate-override-for-semantic-fix — the Android pattern of overriding AccessibilityDelegate to correct framework-semantic defaults at the component-library layer; Slack's SKListAccessibilityDelegate fixing decorative-divider list-count is the canonical instance.
patterns/custom-talkback-actions-as-gesture-alternative — every gesture-only interaction (drag-and-drop, swipe-to- dismiss, etc.) gets a custom TalkBack action as its accessibility alternative; Slack's workspace-switcher "Move before" / "Move after" actions paired with Edit-mode six-dot handle is the canonical instance.
patterns/vpat-driven-a11y-triage — the end-to-end VPAT-audit-to-remediation workflow with four resolution buckets (shovel-ready / component-library theme / platform-convention closure / scope-bounded deferral).
concepts/vpat-voluntary-product-accessibility-template — the third-party accessibility audit artifact; procurement-facing, engineering-backlog-input.
concepts/wcag-platform-applicability-gap — WCAG is web-centric; native-platform a11y conventions sometimes legitimately diverge (Slack's top-app-bar-as-heading and strikethrough-announcement closures are the canonical examples).
concepts/redundant-error-signalling — the WCAG 1.4.1 discipline of never-colour-alone; pair colour with icon, text, and screen-reader announcement.
concepts/workspace-scoped-to-org-wide-migration — the Unified Grid re-architecture primitive.
patterns/prototype-the-path — Slack's named methodology for dogfood-driven incremental re-architecture.
patterns/decouple-frontend-build-from-backend-artifacts — Slack Quip/Canvas's ~35-min-per-build win by cutting the Python-backend → TypeScript-frontend cache-key edge.
patterns/delete-inner-parallelization-inside-outer-orchestrator — Slack's frontend-bundler refactor: delete in-process worker pool so Bazel can parallelise at bundle granularity across workers.
patterns/diff-artifact-validator-for-build-refactor — Slack's Rust-based byte-diff harness that served as the correctness oracle during the Bazel migration.
patterns/upstream-fixes-to-community — Slack's HTTP/3/QUIC support into Prometheus Blackbox Exporter canonicalises the upstream-a-whole-new-feature altitude (distinct from the Shopify × Reanimated fix-existing- feature-at-scale altitude).
patterns/upstream-contribution-parallel-to-in-house-integration — Slack's in-house BBE integration running in parallel with the upstream PR so the intern-timeline HTTP/3 rollout wasn't gated on maintainer merge.
concepts/cache-granularity — the "100 parameters, 2-3 always change" failure mode of coarse cache keys that Slack canonicalised at build-system altitude.
concepts/idempotent-build-action — the precondition Slack's pre-refactor build violated (build steps mutated the working directory).
concepts/separation-of-concerns — the classical principle Slack applied to build code, release code, and setup code — not just application code.
concepts/layering-violation — the structural diagnosis for Slack's old frontend bundler (business logic fused with orchestration + parallelization).
concepts/http-3-probing-gap — the first-order observability failure Slack's HTTP/3 edge rollout surfaced: TCP-shaped black-box probers cannot see QUIC/UDP traffic. Canonical wiki instance.
concepts/client-side-black-box-probe — the monitoring primitive the HTTP/3 probing gap applies to; Slack's BBE fleet is the canonical implementation substrate.
concepts/observability-before-migration — Slack's explicit "monitor first, migrate second" takeaway; new wiki concept that generalises the discipline across transport / protocol / platform migrations.
patterns/director-expert-critic-investigation-loop — Slack Spear's three-persona agent team for security- investigation triage; Director plans / Experts produce findings / Critic audits + condenses. Canonical first wiki instance.
patterns/one-model-invocation-per-task — Slack's decomposition of the 300-word single-prompt prototype into per-task invocations with per-task structured-output schemas.
patterns/hub-worker-dashboard-agent-service — Slack's three-component productisation shape for Spear, replacing the coding-agent-CLI-as-harness prototype.
patterns/phase-gated-investigation-progression — Slack's explicit phase-gated loop (discovery → trace → conclude + Director-Decision meta-phase); each phase has its own model parameters + token budget.
concepts/knowledge-pyramid-model-tiering — Slack's explicit three-tier cost/capability gradient: cheap Experts / mid-tier Critic / top-tier Director. Canonical first wiki instance.
concepts/investigation-phase-progression — the concept behind the phase-gated pattern; phase as application state, not prompt state.
concepts/weakly-adversarial-critic — Slack's named stance for the Critic/Expert relationship. Catches hallucination without degenerating into paranoia.
concepts/prompt-is-not-control — Slack's verbatim architectural lesson: "prompts are just guidelines; they're not an effective method for achieving fine-grained control."

Recent articles¶

2026-04-13 — sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications (Slack Security Engineering team, series part 2 to the 2025-12-01 Spear post. Canonicalises how Spear manages context across long-running multi-agent investigations that can "span hundreds of inference requests and generate megabytes of output". Core architectural claim: three complementary context channels replace raw message history — Director's Journal (typed working memory: decision / observation / finding / question / action / hypothesis + phase + round + timestamp; read by all agents, written only by Director), Critic's Review with the four-tool introspection suite (get_tool_call, get_tool_result, get_toolset_info, list_toolsets) and a 5-level credibility rubric scored against disclosed distribution of 170,000 findings (37.7% Trustworthy / 25.4% Highly-plausible / 11.1% Plausible / 10.4% Speculative / 15.4% Misguided — 25.8% sub-plausibility rate), and the Critic's Timeline that implements narrative- coherence assembly with four explicit consolidation rules, a second 5-level rubric, and top-3 gap identification across three gap types (evidential / temporal / logical). Canonical load-bearing claim on narrative coherence as hallucination filter: "A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with." Also: no message history carry-forward between invocations — "Besides these resources, we do not pass any message history forward between agent invocations" — justified both by token-budget and by cognitive-load arguments (over-sharing "stifles creativity and encourages confirmation bias" even with infinite context). Specimen investigation disclosed: 6,046-event false-positive kernel-module-loading alert with 0.83 Timeline confidence and 3 identified gaps.)
2025-12-01 — sources/2025-12-01-slack-streamlining-security-investigations-with-agents (Slack Security Engineering team retrospective on building Spear, their internal multi-agent security-investigation service that triages detection-system alerts during on-call shifts. First post in a promised series. Load-bearing architectural lesson verbatim: "prompts are just guidelines; they're not an effective method for achieving fine-grained control" — canonical statement of the new concepts/prompt-is-not-control concept after Slack's 300-word single-prompt prototype produced "wildly variable" quality. Prototype rewrite moved control out of the prompt into per-task model invocations with task-specific structured-output schemas (patterns/one-model-invocation-per-task) orchestrated by application code. Three-persona agent team (patterns/director-expert-critic-investigation-loop): Director (plans + progresses phases + writes final report; uses a journaling tool), 4 Experts (Access / Cloud / Code / Threat — each with unique tools + data sources), Critic ("meta-expert" auditing Experts against a rubric, scoring findings, condensing into a timeline). Three phases (patterns/phase-gated-investigation-progression): Discovery (broadcast to all Experts) → Trace (question one Expert, may vary model parameters) → Conclude (final report), with a Director-Decision meta-phase for transition decisions. Knowledge pyramid model tiering (concepts/knowledge-pyramid-model-tiering): verbatim "low, medium, and high-cost models for the expert, critic, and director functions, respectively." Tool-call-heavy Expert work runs on cheap models; rubric-application + condensation runs on mid-tier; strategic decisions on top-tier. Service architecture (patterns/hub-worker-dashboard-agent-service): Hub (API + persistent storage + metrics endpoint) + Worker (queue consumer, event-stream emitter, scalable) + Dashboard (real-time observe + replay + per-invocation debugging) replacing the prototype's coding-agent-CLI harness. Prototype exposed data sources via an MCP stdio server; production Worker's MCP persistence not explicitly disclosed. Critic's weakly-adversarial stance (concepts/weakly-adversarial-critic) verbatim: "The weakly adversarial relationship between the Critic and the expert group helps to mitigate against hallucinations and variability in the interpretation of evidence." Canonical worked example: the Critic caught a credential exposure the Expert missed during a process-ancestry review, the Director then pivoted the investigation to focus on the credential issue, final report surfaced both the security finding and the Expert's "analysis blind spots that require attention." Verbatim: "What is notable about this result is that the expert did not raise the credential exposure in its findings; the Critic noticed it as part of its meta-analysis of the expert's work." On-call shift mode-shift: "we're switching to supervising investigation teams, rather than doing the laborious work of gathering evidence." 9 canonical wiki primitives: source + 1 system (systems/slack-spear) + 4 concepts (concepts/knowledge-pyramid-model-tiering, concepts/investigation-phase-progression, concepts/weakly-adversarial-critic, concepts/prompt-is-not-control) + 4 patterns (patterns/director-expert-critic-investigation-loop, patterns/one-model-invocation-per-task, patterns/hub-worker-dashboard-agent-service, patterns/phase-gated-investigation-progression). Extends 5 pages: systems/model-context-protocol (new Seen-in — first wiki MCP instance inside an internal security-investigation pipeline), concepts/structured-output-reliability (new Seen-in — structured output as multi-agent orchestration-boundary contract), patterns/specialized-agent-decomposition (new Seen-in — canonical security-operations instance at the peer-Expert layer, with supra-agent Director + meta-agent Critic on top), patterns/multi-round-critic-quality-gate (new Seen-in — live-investigation altitude variant, distinguished from Meta's artifact-production rounds shape), patterns/drafter-evaluator-refinement-loop (new Seen-in — investigation-loop-with-third-layer variant, distinguished from Lyft's retry-only shape by the Director who decides progress/pivot/conclude). Scope disposition: Tier-2 on-scope decisively on multi-agent-architecture- canonicalisation grounds. Opens the Slack security- engineering axis on the wiki; first Slack security- operations ingest; seventh Slack coverage axis after developer-productivity, deploy-safety, test-framework- integration, mobile-a11y, fleet-config-management, build-systems, and edge-networking. Zero production numbers disclosed (no throughput / latency / cost / FP rate / token-usage); model families not disclosed; Critic's rubric opaque; Spear name inferred from image slugs not stated in post body. URL verbatim: https://slack.engineering/streamlining-security-investigations-with-agents/. Sibling to Cloudflare AI Code Review (patterns/coordinator-sub-reviewer-orchestration) at the code-review altitude; sibling to Redpanda Openclaw (patterns/four-component-agent-production-stack) at the enterprise-agent-substrate altitude; sibling to Lyft localization (patterns/drafter-evaluator-refinement-loop) at the structured-retry-loop altitude.)
2026-03-31 — sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness (Slack edge-networking team retrospective on rolling out HTTP/3 on the public edge and closing the HTTP/3 probing gap first. Existing SaaS observability vendors had zero native HTTP/3 probe support; Slack's own Prometheus Blackbox Exporter fleet ("a cornerstone of our monitoring") was TCP-shaped and couldn't speak QUIC/UDP. Intern Sebastian Feliciano scoped, implemented, and open-sourced an http3 module into BBE upstream built on systems/quic-go — selected for "wide adoption across other open source technologies, as well as the first-class support it provides in creating http clients in go." Integration snippet: http3Transport := &http3.Transport{TLSClientConfig: tlsConfig, QUICConfig: &quic.Config{}} wrapped in http.Client. Architectural discipline that earned the upstream merge: "had to add this new logic while following the Blackbox Exporter's existing architecture, ensuring the new features maintained the tool's configuration patterns." Because internship timelines ≠ OSS merge timelines, Sebastian "took matters into his own hands and architected an in- house system that utilized the new upstream features for probing out HTTP/3 endpoints" — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics unified in Grafana ("single pane of glass")
reliable HTTP/3 alerts + easier telemetry correlation. New canonical pages (5): 1 system (systems/prometheus-blackbox-exporter), 3 concepts (concepts/client-side-black-box-probe, concepts/http-3-probing-gap, concepts/observability-before-migration), 1 pattern (patterns/upstream-contribution-parallel-to-in-house-integration). Extended: systems/quic-go (second wiki instance at upstream-tooling altitude, distinct from the PlanetScale HTTP/3 driver benchmark instance), systems/prometheus (BBE axis added to the Airbnb- observability-ingestion-dominated page), systems/grafana (single-pane-of-glass unified multi-HTTP-version view), concepts/http-3 (probing-gap framing added alongside the existing Cloudflare latency framing), concepts/observability (observability-as-migration-gate altitude added), patterns/upstream-fixes-to-community (upstream-a-whole-new-feature altitude added alongside the existing fix-at-scale altitude). Scope takeaways verbatim: "Monitor first, and migrate second. ... getting observability right as a precursor to migration makes everything faster"; "Contributing open source pays dividends"; "Bet on your interns." Opens the Slack edge-networking-and-synthetic-monitoring axis on the wiki; sixth Slack coverage axis (after developer-productivity, deploy-safety, test-framework-integration, mobile-a11y, fleet-config-management, and build-systems). Tier-2 on- scope decisively: real engineering retrospective with architecture diagram, code snippet, upstream-PR reference, specific library selection rationale, and explicit migration-gate framing. Operational numbers thin (article is about monitoring, not HTTP/3 edge perf): hundreds of thousands of HTTP/3 endpoints to probe; zero SaaS vendors supporting HTTP/3 out-of-box at investigation time; code merged to BBE at pinned commit bee8e9102a106bff63281ee9c64c7b1275ef21d0. URL verbatim: https://slack.engineering/from-custom-to-open-scalable-network-probing-and-http-3-readiness-with-prometheus/.)
2025-11-06 — sources/2025-11-06-slack-build-better-software-to-build-software-better (Slack Quip/Canvas team retrospective on taking their build from 60 minutes to as low as 10 minutes (~6× speed-up) by adopting Bazel + Starlark and doing the unglamorous engineering work to benefit from them. Load-bearing insight: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — the pre-existing build had cycles, non-hermetic action nodes, and cache keys with hundreds of parameters, giving a zero cache hit rate that no build tool could fix. Two concrete wins: severing the Python↔TypeScript dependency edge (saved ~35 min/build on its own; canonical patterns/decouple-frontend-build-from-backend-artifacts) and deleting in-process parallelization inside the frontend bundler so Bazel could schedule across bundles (canonical patterns/delete-inner-parallelization-inside-outer-orchestrator). Correctness oracle during refactor was a Rust byte-diff tool comparing old- and new-build artifacts (canonical patterns/diff-artifact-validator-for-build-refactor). Key new concepts: concepts/cache-granularity ("100 parameters, 2-3 always change" failure mode), concepts/idempotent-build-action (pre-refactor build mutated the working directory), concepts/layering-violation (frontend bundler fused business logic + orchestration + parallelization), concepts/separation-of-concerns applied across backend/frontend, Python/TypeScript, and application/build-code axes. Extends systems/bazel, concepts/hermetic-build, concepts/build-graph, concepts/cache-hit-rate with the zero-hit-rate structural-failure story. Operational numbers: 60 min → 10 min (best-case, cached+parallelised) / 12 min (average) / 30 min (cache miss). Opens the Slack build-system- engineering axis on the wiki; first Slack build-tooling ingest.)
2025-11-19 — sources/2025-11-19-slack-android-vpat-journey (Slack Android team retrospective on triaging a 2024 third- party VPAT audit conducted after the IA4 UI redesign; 8 recurring accessibility themes identified (7 resolved, 1 deferred as future work). Representative-of-all-four- buckets worked example of [[patterns/vpat-driven-a11y- triage]]: shovel-ready fixes assigned immediately; recurring themes resolved at Slack Kit component- library layer (OutlinedTextField error-announcement, SKBanner error auto-announce, SKListAccessibilityDelegate overrides CollectionInfo to exclude decorative SK divider items from TalkBack's "N items in a list" count — canonical instance of patterns/accessibility-delegate-override-for-semantic-fix; workspace-switcher drag-and-drop via Edit-mode + six-dot handle + TalkBack custom actions "Move before" / "Move after" invoked by three-finger tap or L/r drawing gestures — canonical instance of patterns/custom-talkback-actions-as-gesture-alternative); platform-convention closures for top-app-bar-as-heading (WCAG 1.3.1; verified via native Google apps) and strikethrough-announcement (WCAG 1.3.1; verified via blind- community consultation) — canonical instances of concepts/wcag-platform-applicability-gap; error-icon redundancy resolving the colour-alone P3 theme — concepts/redundant-error-signalling (WCAG 1.4.1); scope-bounded deferral of WCAG 2.1.1 keyboard-nav because Slack Android has no tablet support. New canonical pages: source + 4 systems (systems/slack-android, systems/slack-kit, systems/talkback, systems/android-accessibility-framework) + 3 concepts (concepts/vpat-voluntary-product-accessibility-template, concepts/wcag-platform-applicability-gap, concepts/redundant-error-signalling) + 3 patterns (patterns/accessibility-delegate-override-for-semantic-fix, patterns/custom-talkback-actions-as-gesture-alternative, patterns/vpat-driven-a11y-triage). Extended: concepts/automated-vs-manual-testing-complementarity (new Seen-in: this post is the manual / third-party / periodic layer complementing the 2025-01-07 automated Axe-in-Playwright ingest). Opens the fourth axis of Slack coverage on the wiki: mobile accessibility engineering. Tier-2 borderline include on mobile-a11y-pattern- canonicalisation grounds; user explicit full-ingest override of prior batch-skip; same disposition as the 2025-01-07 and 2024-06-19 Slack developer-productivity ingests. No incident / latency / scale disclosures; pure engineering-process retrospective. URL verbatim: https://slack.engineering/android-vpat-journey/.)
2025-10-23 — sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption (Archie Gunasekara's follow-up to Slack's 2024 Advancing Our Chef Infrastructure post describes phase two of Slack's EC2 / Chef deploy-safety work: instead of migrating to Chef Policyfiles (which would have required every service team to rewrite roles, environments, and cookbooks), Slack extended the existing EC2 framework in two load-bearing changes. (1) Splitting one production Chef environment into six AZ-bucketed environments (prod-1 … prod-6) rolled out via a release train with prod-1 as canary: prod-1 receives new versions every hour (hot canary); prod-2 → prod-6 advance via release train, with the next version gated on the previous version completing the train. Why the canary is parallel rather than head-of-train: "updating prod-1 frequently with the latest version allows us to detect issues closer to when they were introduced" rather than testing cumulative-change artifacts at the canary. Boot- time mapping from AZ to environment via Poptart Bootstrap (Slack's cloud-init-phase AMI tool) ensures newly provisioned nodes inherit the AZ-bucket boundary from instance 0 — the explicit fix for the scale-out-picks-up-bad-config failure mode that per-node cron staggering didn't address. (2) Replacing cron-driven Chef runs with a signal-driven pull model via a new service called Chef Summoner that watches an S3 bucket populated by the existing Chef Librarian artifact- promotion service at chef-run-triggers/<stack>/<env>. Signal payload carries Splay (randomised jitter), Timestamp, and a full ManifestRecord (artifact version, cookbook-versions map, S3 artifact pointer, upload_complete ordering flag). Summoner deduplicates against local state (last-run-time + artifact-version), applies Splay, and triggers chef-client. Plus a fallback cron baked into every AMI that independently triggers chef-client if Summoner hasn't run Chef in the last 12 hours — the recovery path for broken-Summoner deployments. Also enforces the 12-hour compliance SLA. Closes by marking the legacy EC2 platform feature-complete + maintenance-mode and previews a brand-new EC2 successor called Shipyard (service-level deployments + metric-driven rollouts + automated rollbacks) for teams that can't yet move to Bedrock. Canonical wiki primitives: 6 new systems (systems/chef + systems/chef-policyfiles + systems/chef-librarian + systems/chef-summoner + systems/poptart-bootstrap + systems/slack-shipyard), 6 new concepts (concepts/az-bucketed-environment-split + concepts/splay-randomised-run-jitter + concepts/signal-driven-chef-trigger + concepts/s3-signal-bucket-as-config-fanout + concepts/fallback-cron-for-self-update-safety + concepts/cookbook-artifact-versioning), 4 new patterns (patterns/split-environment-per-az-for-blast-radius + patterns/release-train-rollout-with-canary + patterns/signal-triggered-fleet-config-apply + patterns/self-update-with-independent-fallback-cron). Extends concepts/blast-radius with the fleet- configuration substrate altitude and systems/aws-s3 with the S3-as-config-fanout-bus altitude (previously canonicalised as object-store + CDC-log-store + tiered- cold-tier). Natural companion to the 2025-10-07 Deploy Safety retrospective — one level below the program-altitude framing, drilling into the EC2 / Chef substrate specifically. Tier-2 on-scope decisively: real distributed-systems internals, scaling-trade-off rationale, concrete operational disclosures (6 production environments / 12-hour compliance SLA / hourly promotions). Opens the Slack fleet-configuration-management axis.)
2025-10-07 — sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change (Retrospective on Slack's 18-month Deploy Safety Program — 90% reduction in customer impact hours from change-triggered incidents (Feb-Apr 2024 peak → Jan 2025). Load-bearing framing statistic: 73% of customer-facing incidents were Slack-induced-change-triggered, particularly code deploys. Three North Star goals across all deployment systems for highest-importance services: 10-min automated MTTR / 20-min manual MTTR / detect before 10% fleet exposure. Canonical program metric: "Hours of customer impact from high-severity and selected medium-severity change-triggered incidents" — explicitly framed as "imperfect analog of customer sentiment" sitting in a three-layer chain Customer sentiment <-> Program Metric <-> Project Metric. Five-axis investment strategy: invest widely + bias for action / known-pain first / invest further based on results / curtail least impactful / flexible roadmap — explicit framing that below-expectation projects are "not failures" but "critical input." Phase-change architectural shift: "Once automatic rollbacks were introduced we observed dramatic improvement in results." Composed Webapp-backend investment sequence (Q1 metric monitoring → Q2 manual rollback via automatic alerts → Q3-Q4 automatic rollback → Q4+ ≤10 min customer impact → pattern copied to Webapp frontend → centralised deployment orchestration system inspired by [ReleaseBot] + AWS Pipelines). Trailing-metric-patience discipline with 3-6 month delivery-to-impact lag; mid-stream sub-signals. Tool fluency discipline — "Use the tooling often, not just for the infrequent worst case scenarios." Direct-per-team- outreach discipline — "Not all teams and systems are the same." Canonical wiki primitives: (1) systems/slack-deploy-safety-program (new) — the program as a named wiki system. (2) systems/slack-releasebot + systems/slack-bedrock (new) — the named ReleaseBot inspiration + Bedrock substrate as stub pages. (3) concepts/change-triggered-incident-rate (new) — the justifying statistic. (4) concepts/customer-impact-hours-metric (new) — the program-metric-as-sentiment-analog choice. (5) concepts/pre-10-percent-fleet-detection-goal (new) — blast-radius-cap at fleet-level. (6) concepts/trailing-metric-patience (new) — the patience discipline. (7) patterns/automated-detect-remediate-within-10-minutes (new) — the 10-min/20-min MTTR pair. (8) patterns/centralised-deployment-orchestration-across-systems (new) — the multi-substrate deploy-orchestrator pattern. (9) patterns/invest-widely-then-double-down-on-impact (new) — the five-axis strategy. Extends patterns/fast-rollback with the fully-automated altitude variant, concepts/feedback-control-loop-for-rollouts with the organisational-altitude instance, concepts/blast-radius with the fleet-percentage rollout-gate altitude, concepts/dora-metrics with the "maintain development velocity" co-equal North Star, concepts/observability with the Q1-first-investment framing. Opens the Slack reliability-engineering axis on the wiki; first Slack production-reliability ingest.)
2025-01-07 — sources/2025-01-07-slack-automated-accessibility-testing-at-slack (Slack Frontend Test Frameworks team retrospective on integrating Axe Core into Slack's pre-existing Playwright E2E framework as a custom-fixture extension. Two failed integration attempts first: baking Axe into RTL's render (blocked by Slack's customised Jest setup) and baking Axe into Playwright's Locator interaction methods (blocked by Locator auto-wait semantics). Landing architecture: slack.utils.a11y.runAxeAndSaveViolations() on the pre-existing custom slack fixture, invoked explicitly at page-ready. Canonicalises five reusable patterns: fixture- extension as integration surface for cross-cutting testing concerns, two-axis exclusion list (ticketed-known-issue + out-of-scope-by- design), severity- gated reporting (critical only at launch, serious / moderate / mild as future work), tri-mode opt-in test execution (A11Y_ENABLE flag composes on-demand local + scheduled nightly Buildkite + opt-in CI gate), alert- channel-to-Jira auto-ticket workflow (Slack alert channel spins up pre-populated Jira ticket with canonical label + Epic placement). Four new concepts: concepts/wcag-2-1-a-aa-scope, concepts/automated-vs-manual-testing-complementarity, concepts/playwright-locator-auto-wait, concepts/severity-filtered-violation-reporting. Three new systems: systems/axe-core, systems/axe-core-playwright, systems/jira. 91 tests in initial suite, non-blocking, WCAG 2.1 A+AA scope, critical impact only. Borderline Tier-2 ingest — developer-productivity rather than distributed-systems internals; same disposition as the 2024-06-19 Enzyme→RTL codemod ingest; the canonicalised patterns generalise to any automated-check-integration-into- existing-test-suite problem.)
2024-08-26 — sources/2024-08-26-slack-unified-grid-how-we-re-architected-slack-for-our-largest-customers (Slack's 2024-08-26 retrospective on the Unified Grid project — a decade-after-launch re-architecture of the Slack client and backend's fundamental tenant-scoping assumption, shifting from workspace-centric to org-wide.)
2024-06-19 — sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library (Retrospective on migrating 15,000+ Enzyme tests to React Testing Library using an AST + LLM hybrid pipeline. AST-only hit ~45% ceiling; LLM-only (Claude 2.1) was 40-60% with wild variance; hybrid reached ~80% on selected evaluation files. Per-test-case DOM captured by instrumenting Enzyme mount/shallow. In-code annotation comments from AST pass shaped LLM output. Tool later open-sourced as @slack/enzyme-to-rtl-codemod. ~64% adoption across Slack's RTL migration; 338-file CI-nightly run produced ~500 auto- passing test cases (~22% documented developer-time savings, lower-bound.))