Slack¶
Slack Engineering blog (slack.engineering). Tier-2 source on the sysdesign-wiki. Slack is a workplace-messaging platform (acquired by Salesforce in 2020) with substantial engineering output across backend (Flannel, Vitess-for-Slack), mobile (cross-platform client architecture), frontend (TypeScript/ React at large scale, shared-edit collaboration), and developer infrastructure (CI, test frameworks, migration tooling).
This wiki's coverage of Slack spans seven axes so far:
- Developer-productivity tooling at scale — Slack's public retrospective on using LLMs to automate a 15,000-test Enzyme → RTL migration, which canonicalised a reusable AST + LLM hybrid conversion pattern.
- Reliability engineering at scale — Slack's 18-month Deploy Safety Program (mid-2023 → Jan 2025) that reduced customer impact hours from change-triggered incidents by 90%, canonicalised in the 2025-10-07 retrospective.
- Test-framework integration at scale — Slack's 2022- launched integration of Axe Core accessibility checks into the existing Playwright E2E suite as a custom- fixture extension, canonicalising several reusable patterns (fixture-extension as integration surface, two-axis exclusion-list, severity-gated reporting, tri-mode opt-in execution, alert-to-Jira workflow).
- Mobile accessibility engineering — Slack's 2024 third-
party VPAT audit of the IA4-redesigned Android client;
8 recurring themes (7 resolved, 1 deferred) with fixes
concentrated at Slack Kit component-
library layer. Canonicalised the
patterns/accessibility-delegate-override-for-semantic-fix
pattern (Slack's new
SKListAccessibilityDelegatefixingCollectionInfofor decorative dividers), the patterns/custom-talkback-actions-as-gesture-alternative pattern (workspace-switcher drag-reorder via TalkBack "Move before" / "Move after" actions + six-dot drag handle + Edit mode), and the patterns/vpat-driven-a11y-triage four-bucket workflow. Complements the 2025-01-07 automated-Axe ingest as the manual / third-party / periodic layer of the same broader a11y strategy (see concepts/automated-vs-manual-testing-complementarity). - Fleet-configuration-management at scale — Slack's 2025-10-23 Chef phase-2 post canonicalises the AZ-bucketed environment split, the signal-driven fleet-config-apply pipeline (Chef Librarian → S3 → Chef Summoner), the release-train-with-canary rollout, and the self-update- with-independent-fallback-cron pattern — one-level-below the Deploy Safety Program's altitude, in the EC2 / Chef substrate specifically.
- Build-system engineering at scale — Slack Quip/Canvas team's 60 min → 10 min (~6×) build speed-up via Bazel + Starlark adoption, canonicalised in the 2025-11-06 retrospective. The load-bearing insight: Bazel gives nothing to a build whose graph has cycles / non-hermetic actions / coarse cache keys; the real wins came from applying classical separation-of-concerns and layering principles to build code itself. Canonicalised three new patterns — patterns/decouple-frontend-build-from-backend-artifacts, patterns/delete-inner-parallelization-inside-outer-orchestrator, patterns/diff-artifact-validator-for-build-refactor — plus the concepts/cache-granularity and concepts/idempotent-build-action concepts.
- Edge-networking and synthetic monitoring — Slack's 2026-03-31 retrospective on rolling out HTTP/3 on the public edge. Closed the HTTP/3 probing gap — neither SaaS observability tools nor Slack's Prometheus Blackbox Exporter fleet ("a cornerstone of our monitoring") natively spoke QUIC/UDP before the intern project. Intern Sebastian Feliciano scoped, implemented, and open-sourced QUIC support into Prometheus Blackbox Exporter upstream using systems/quic-go as the client library, then built an in-house integration on the same branch so Slack could ship HTTP/3 probing to production before upstream merge — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Extends the patterns/upstream-fixes-to-community pattern with a new altitude: upstream-a-whole-new-feature, distinct from the Shopify × Reanimated fix-existing- feature-at-scale altitude. Canonicalised Slack's explicit "monitor first, migrate second" takeaway as the concepts/observability-before-migration concept. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics in one Grafana "single pane of glass".
- Security-engineering + AI-agent operations — Slack Security Engineering team's 2025-12-01 retrospective on Spear, their multi-agent security-investigation service that triages detection- system alerts during on-call shifts. Opens the first post in a promised series. Canonical first wiki instance of the patterns/director-expert-critic-investigation-loop pattern (three-persona agent team: Director plans / progresses phases / writes final report, four Experts — Access / Cloud / Code / Threat — produce domain-specific findings, Critic audits + condenses). Canonical first wiki instance of the concepts/knowledge-pyramid-model-tiering concept (Experts on cheap models, Critic on mid-tier, Director on top-tier). Canonical first wiki instance of the patterns/hub-worker-dashboard-agent-service pattern (Hub + Worker + Dashboard productisation shape). Load-bearing architectural lesson verbatim: "prompts are just guidelines; they're not an effective method for achieving fine-grained control" — canonical statement of concepts/prompt-is-not-control. Slack's canonical emergent-behaviour worked example: Critic caught a credential exposure the Expert missed, Director pivoted the investigation. Canonicalised four concepts (concepts/knowledge-pyramid-model-tiering, concepts/investigation-phase-progression, concepts/weakly-adversarial-critic, concepts/prompt-is-not-control) + four patterns (patterns/director-expert-critic-investigation-loop, patterns/one-model-invocation-per-task, patterns/hub-worker-dashboard-agent-service, patterns/phase-gated-investigation-progression) + one system (systems/slack-spear).
Key systems¶
- systems/slack-deploy-safety-program — the 18-month reliability program; 90% reduction in customer impact hours; 10-min automated MTTR / 20-min manual MTTR / detect-before-10%- fleet North Stars; "imperfect analog of customer sentiment" program metric; exec-sponsored; OKR-weighted.
- systems/slack-releasebot — Slack's 2018-era metrics-based- deploy + automatic-rollback orchestrator for Webapp backend; the blueprint the 2023-2025 centralised orchestration system generalises across substrates.
- systems/slack-bedrock — Slack's internal compute platform over Kubernetes; the substrate on which the first fully- automated metrics-based-deploy-with-auto-rollback regime was built.
- systems/slack-spear — Slack's multi-agent security- investigation service that triages detection-system alerts during on-call shifts. Three-persona agent team (Director / 4 Experts / Critic) running across three phases (discovery / trace / conclude) on a Hub + Worker + Dashboard service architecture. Name inferred from image-asset URL slugs; post uses "our service".
- systems/slack-shipyard — Slack's upcoming successor to the legacy Chef-based EC2 platform. Preview-only in the 2025-10-23 post; service-level deployments, metric-driven rollouts, fully-automated rollbacks; soft launch Q4 2025 with two pilot teams. For teams that can't yet move to Bedrock.
- systems/chef / systems/chef-librarian / systems/chef-summoner / systems/poptart-bootstrap / systems/chef-policyfiles — Slack's legacy EC2 fleet- configuration substrate and its phase-2 components (the Policyfiles alternative was explicitly rejected on blast- radius-of-change grounds).
- systems/enzyme-to-rtl-codemod — Slack's open-sourced
(
@slack/enzyme-to-rtl-codemod) AI-powered test-conversion pipeline: AST codemod handles deterministic Enzyme→RTL rewrites + writes in-code annotation comments for hard cases, rendered DOM is captured per-test- case by instrumenting Enzyme'smount/shallow, and an LLM (Anthropic Claude 2.1 at time of post) consumes the annotated file + DOM + structured prompt to finish the conversion. ~80% quality on evaluation files; ~64% adoption across Slack's RTL migration. - systems/enzyme — testing framework Slack was migrating away from (no React 18 adapter).
- systems/react-testing-library — target framework.
- systems/claude-2-1 — LLM backend used in the original pipeline (via Slack's internal DevXP AI team integration).
- systems/jest — underlying test runner.
- systems/playwright — Slack's E2E framework. Used as the integration substrate for Axe accessibility checks via the custom-fixture extension pattern.
- systems/axe-core / systems/axe-core-playwright — Deque Systems' accessibility rule engine and its Playwright binding. Slack's 2022-launched integration into the Playwright E2E suite.
- systems/buildkite — Slack's CI orchestrator; hosts the nightly a11y regression run (one leg of Slack's tri-mode opt-in execution pattern).
- systems/jira — Slack's ticket-tracking tool; receives auto-created tickets from the Slack a11y alert-channel workflow (patterns/alert-channel-to-jira-auto-ticket-workflow).
- systems/slack-android — Slack's native Android client; anchors the 2024 VPAT retrospective's mobile-a11y disclosures. Phone-only (no large-form-factor support yet).
- systems/slack-kit — Slack's shared mobile component
library (SK). Components surfaced on the wiki so far:
OutlinedTextField,SKBanner,SKList/SKListAdapter,SK divider, and theSKListAccessibilityDelegateintroduced by the VPAT resolution. - systems/talkback — Android's built-in screen reader; the primary assistive-tech target for Slack Android's a11y work.
- systems/android-accessibility-framework — Android
platform a11y APIs (
AccessibilityDelegate,AccessibilityNodeInfoCompat,CollectionInfo, custom- action API) that the Slack Kit fixes plug into. - systems/prometheus-blackbox-exporter — "a cornerstone
of our monitoring" at Slack; canonical
client-side black-
box probing substrate for edge endpoints. Slack intern
Sebastian Feliciano open-sourced the
http3module into this upstream project to close the HTTP/3 probing gap, built on systems/quic-go. - systems/quic-go — Go QUIC library Slack's BBE
http3contribution is built on; "wide adoption across other open source technologies, as well as the first- class support it provides in creating http clients in go" was the selection rationale. - systems/prometheus — the TSDB backing Slack's edge-probing metrics pipeline.
- systems/grafana — Slack's "single pane of glass" dashboarding layer; canonically unifies HTTP/1.1 + HTTP/2
- HTTP/3 probe metrics side-by-side post-BBE-contribution.
- systems/enterprise-grid / systems/unified-grid / systems/slack-rtm — Slack's tenancy substrate before / after / pre-requisite for the 2024 Unified Grid re-architecture.
- systems/slack-quip / systems/slack-canvas — the collaborative-document + in-Slack canvas surfaces whose shared Python-backend + TypeScript-frontend build pipeline is the subject of Slack's 2025-11-06 build-system retrospective (60 min → 10 min).
- systems/bazel / systems/starlark — the build system and its constrained DSL adopted by the Quip/Canvas team; central to the 60 min → 10 min story. Slack's framing: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — adopting Bazel alone achieves nothing; the engineering work to meet its preconditions is where the speed-up lives.
Key patterns / concepts¶
- patterns/automated-detect-remediate-within-10-minutes — Slack's canonical 10-min-auto / 20-min-manual MTTR pair as deployment-safety North Star; paired with the customer 10-minute-disruption threshold.
- patterns/centralised-deployment-orchestration-across-systems — the architectural generalisation of Slack ReleaseBot's Webapp-backend-only automation to Webapp frontend, infra, EC2, Terraform.
- patterns/invest-widely-then-double-down-on-impact — the five-axis investment strategy for data-scarce trailing-metric reliability programs; explicit framing that below-expectation projects are "not failures".
- concepts/customer-impact-hours-metric — the program-metric choice as "imperfect analog of customer sentiment"; the three-layer sentiment↔program-metric↔project-metric chain.
- concepts/change-triggered-incident-rate — 73% of customer-facing incidents were change-triggered at program start; the statistic that justified the program.
- concepts/pre-10-percent-fleet-detection-goal — canonical "detect before 10% of fleet" blast-radius-cap goal.
- concepts/trailing-metric-patience — 3-6 month lag from project delivery to impact visibility; mid-stream sub-signals discipline.
- patterns/split-environment-per-az-for-blast-radius —
Slack's 2025-10-23 Chef phase-2 pattern: split one
prodChef environment into six AZ-bucketed environments (prod-1…prod-6) with boot-time mapping via Poptart Bootstrap. Bounds the scale-out-picks-up-bad-config failure mode. - patterns/release-train-rollout-with-canary — Slack's
prod-1hourly canary +prod-2→prod-6strictly sequential release train; canary is parallel to the train, not at its head, so it tests incremental diffs rather than cumulative ones. - patterns/signal-triggered-fleet-config-apply — Slack's Chef Librarian → S3 → Chef Summoner pipeline that replaced fixed-cron Chef runs; canonical first wiki instance of S3 as a config-management signal bus.
- patterns/self-update-with-independent-fallback-cron — Slack's solution to the self-update paradox: Chef Summoner updates itself via Chef runs, and a 12-hour fallback cron baked into every AMI triggers Chef runs independently if Summoner hasn't in the last 12 hours.
- concepts/az-bucketed-environment-split — the concept behind the AZ-bucketed pattern.
- concepts/signal-driven-chef-trigger — the concept behind the signal-triggered pattern.
- concepts/s3-signal-bucket-as-config-fanout — the concept behind S3 as a config-fanout substrate.
- concepts/splay-randomised-run-jitter — thundering-herd mitigation for signal-driven fleet-config runs.
- concepts/fallback-cron-for-self-update-safety — the concept behind the fallback-cron pattern.
- concepts/cookbook-artifact-versioning — the Chef- ecosystem rollout unit.
- patterns/ast-plus-llm-hybrid-conversion — compose an AST pre-pass with an LLM for deterministic code transformation; Slack lifted conversion quality from 40-60% (pure LLM) to ~80% (hybrid).
- patterns/in-code-annotation-as-llm-guidance — AST pass writes comments next to hard call sites to shape LLM attention, instead of stuffing everything into the system prompt. Slack: "we successfully minimized hallucinations and nonsensical conversions from the LLM".
- concepts/abstract-syntax-tree — canonical role both as conversion primitive and as hallucination-control primitive for LLM pipelines.
- concepts/llm-hallucination — Slack's named failure mode for pure-prompt-based code conversion.
- concepts/llm-conversion-hallucination-control — the structural problem class this post articulates.
- concepts/dom-context-injection-for-llm — capturing per-test-case rendered DOM via Enzyme render-method instrumentation, injecting into the LLM prompt.
- patterns/a11y-checks-via-playwright-fixture-extension — canonical integration surface for adding cross-cutting testing concerns (a11y, perf, visual regression) to an existing Playwright suite via the custom-fixture model. Slack's 2022-launched Axe integration is the canonical instance.
- patterns/exclusion-list-for-known-issues-and-out-of-scope-rules — two-axis exclusion list (known-issue-ticketed + out-of-scope-by-design) applied as pre-audit filter so automation signal stays high.
- patterns/severity-gated-violation-reporting — report only highest-severity bucket at new-system launch; defer lower severities to explicit future work. Canonical anti- alert-fatigue rollout lever.
- patterns/tri-mode-opt-in-test-execution — single default-off environment flag composing three execution modes (on-demand local + scheduled nightly Buildkite + opt-in CI gate) for new test classes.
- patterns/alert-channel-to-jira-auto-ticket-workflow — violation output terminates in pre-populated Jira ticket rather than free-form alert; the alert channel becomes the ticket-creation UI.
- concepts/wcag-2-1-a-aa-scope — WCAG 2.1 A+AA as the
conventional scope-picker for automated a11y; expressed in
Axe via
wcag2a/wcag2aa/wcag21a/wcag21aatag set. - concepts/automated-vs-manual-testing-complementarity — automated testing is a layer in a broader strategy, not a substitute for manual testing (especially in a11y where screen-reader UX requires human judgment).
- concepts/playwright-locator-auto-wait — Playwright's Locator auto-wait guarantees element-level readiness but not whole-page readiness; this is why whole-page audits (a11y, visual regression, perf snapshots) can't be embedded into Locator interaction methods.
- concepts/severity-filtered-violation-reporting — the concept behind patterns/severity-gated-violation-reporting; narrow-filter-at-launch + widen-as-calibrated.
- patterns/accessibility-delegate-override-for-semantic-fix
— the Android pattern of overriding
AccessibilityDelegateto correct framework-semantic defaults at the component-library layer; Slack'sSKListAccessibilityDelegatefixing decorative-divider list-count is the canonical instance. - patterns/custom-talkback-actions-as-gesture-alternative — every gesture-only interaction (drag-and-drop, swipe-to- dismiss, etc.) gets a custom TalkBack action as its accessibility alternative; Slack's workspace-switcher "Move before" / "Move after" actions paired with Edit-mode six-dot handle is the canonical instance.
- patterns/vpat-driven-a11y-triage — the end-to-end VPAT-audit-to-remediation workflow with four resolution buckets (shovel-ready / component-library theme / platform-convention closure / scope-bounded deferral).
- concepts/vpat-voluntary-product-accessibility-template — the third-party accessibility audit artifact; procurement-facing, engineering-backlog-input.
- concepts/wcag-platform-applicability-gap — WCAG is web-centric; native-platform a11y conventions sometimes legitimately diverge (Slack's top-app-bar-as-heading and strikethrough-announcement closures are the canonical examples).
- concepts/redundant-error-signalling — the WCAG 1.4.1 discipline of never-colour-alone; pair colour with icon, text, and screen-reader announcement.
- concepts/workspace-scoped-to-org-wide-migration — the Unified Grid re-architecture primitive.
- patterns/prototype-the-path — Slack's named methodology for dogfood-driven incremental re-architecture.
- patterns/decouple-frontend-build-from-backend-artifacts — Slack Quip/Canvas's ~35-min-per-build win by cutting the Python-backend → TypeScript-frontend cache-key edge.
- patterns/delete-inner-parallelization-inside-outer-orchestrator — Slack's frontend-bundler refactor: delete in-process worker pool so Bazel can parallelise at bundle granularity across workers.
- patterns/diff-artifact-validator-for-build-refactor — Slack's Rust-based byte-diff harness that served as the correctness oracle during the Bazel migration.
- patterns/upstream-fixes-to-community — Slack's HTTP/3/QUIC support into Prometheus Blackbox Exporter canonicalises the upstream-a-whole-new-feature altitude (distinct from the Shopify × Reanimated fix-existing- feature-at-scale altitude).
- patterns/upstream-contribution-parallel-to-in-house-integration — Slack's in-house BBE integration running in parallel with the upstream PR so the intern-timeline HTTP/3 rollout wasn't gated on maintainer merge.
- concepts/cache-granularity — the "100 parameters, 2-3 always change" failure mode of coarse cache keys that Slack canonicalised at build-system altitude.
- concepts/idempotent-build-action — the precondition Slack's pre-refactor build violated (build steps mutated the working directory).
- concepts/separation-of-concerns — the classical principle Slack applied to build code, release code, and setup code — not just application code.
- concepts/layering-violation — the structural diagnosis for Slack's old frontend bundler (business logic fused with orchestration + parallelization).
- concepts/http-3-probing-gap — the first-order observability failure Slack's HTTP/3 edge rollout surfaced: TCP-shaped black-box probers cannot see QUIC/UDP traffic. Canonical wiki instance.
- concepts/client-side-black-box-probe — the monitoring primitive the HTTP/3 probing gap applies to; Slack's BBE fleet is the canonical implementation substrate.
- concepts/observability-before-migration — Slack's explicit "monitor first, migrate second" takeaway; new wiki concept that generalises the discipline across transport / protocol / platform migrations.
- patterns/director-expert-critic-investigation-loop — Slack Spear's three-persona agent team for security- investigation triage; Director plans / Experts produce findings / Critic audits + condenses. Canonical first wiki instance.
- patterns/one-model-invocation-per-task — Slack's decomposition of the 300-word single-prompt prototype into per-task invocations with per-task structured-output schemas.
- patterns/hub-worker-dashboard-agent-service — Slack's three-component productisation shape for Spear, replacing the coding-agent-CLI-as-harness prototype.
- patterns/phase-gated-investigation-progression — Slack's explicit phase-gated loop (discovery → trace → conclude + Director-Decision meta-phase); each phase has its own model parameters + token budget.
- concepts/knowledge-pyramid-model-tiering — Slack's explicit three-tier cost/capability gradient: cheap Experts / mid-tier Critic / top-tier Director. Canonical first wiki instance.
- concepts/investigation-phase-progression — the concept behind the phase-gated pattern; phase as application state, not prompt state.
- concepts/weakly-adversarial-critic — Slack's named stance for the Critic/Expert relationship. Catches hallucination without degenerating into paranoia.
- concepts/prompt-is-not-control — Slack's verbatim architectural lesson: "prompts are just guidelines; they're not an effective method for achieving fine-grained control."
Recent articles¶
- 2026-04-13 — sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications
(Slack Security Engineering team, series part 2 to the
2025-12-01 Spear post. Canonicalises how
Spear manages context across
long-running multi-agent investigations that can "span
hundreds of inference requests and generate megabytes of
output". Core architectural claim:
three
complementary context channels replace raw message history
— Director's
Journal (typed working memory: decision / observation /
finding / question / action / hypothesis + phase + round +
timestamp; read by all agents, written only by Director),
Critic's Review with the
four-tool
introspection suite (
get_tool_call,get_tool_result,get_toolset_info,list_toolsets) and a 5-level credibility rubric scored against disclosed distribution of 170,000 findings (37.7% Trustworthy / 25.4% Highly-plausible / 11.1% Plausible / 10.4% Speculative / 15.4% Misguided — 25.8% sub-plausibility rate), and the Critic's Timeline that implements narrative- coherence assembly with four explicit consolidation rules, a second 5-level rubric, and top-3 gap identification across three gap types (evidential / temporal / logical). Canonical load-bearing claim on narrative coherence as hallucination filter: "A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with." Also: no message history carry-forward between invocations — "Besides these resources, we do not pass any message history forward between agent invocations" — justified both by token-budget and by cognitive-load arguments (over-sharing "stifles creativity and encourages confirmation bias" even with infinite context). Specimen investigation disclosed: 6,046-event false-positive kernel-module-loading alert with 0.83 Timeline confidence and 3 identified gaps.) - 2025-12-01 — sources/2025-12-01-slack-streamlining-security-investigations-with-agents
(Slack Security Engineering team retrospective on building
Spear, their internal multi-agent
security-investigation service that triages detection-system
alerts during on-call shifts. First post in a promised
series. Load-bearing architectural lesson verbatim:
"prompts are just guidelines; they're not an effective
method for achieving fine-grained control" — canonical
statement of the new
concepts/prompt-is-not-control concept after Slack's
300-word single-prompt prototype produced "wildly variable"
quality. Prototype rewrite moved control out of the prompt
into per-task model invocations with task-specific
structured-output schemas (patterns/one-model-invocation-per-task)
orchestrated by application code. Three-persona agent team
(patterns/director-expert-critic-investigation-loop):
Director (plans + progresses phases + writes final report;
uses a journaling tool), 4 Experts (Access /
Cloud / Code / Threat — each with unique tools + data
sources), Critic ("meta-expert" auditing Experts against
a rubric, scoring findings, condensing into a timeline).
Three phases (patterns/phase-gated-investigation-progression):
Discovery (broadcast to all Experts) → Trace (question one
Expert, may vary model parameters) → Conclude (final report),
with a Director-Decision meta-phase for transition decisions.
Knowledge pyramid model tiering
(concepts/knowledge-pyramid-model-tiering): verbatim
"low, medium, and high-cost models for the expert, critic,
and director functions, respectively." Tool-call-heavy
Expert work runs on cheap models; rubric-application +
condensation runs on mid-tier; strategic decisions on
top-tier. Service architecture
(patterns/hub-worker-dashboard-agent-service): Hub
(API + persistent storage + metrics endpoint) + Worker
(queue consumer, event-stream emitter, scalable) + Dashboard
(real-time observe + replay + per-invocation debugging)
replacing the prototype's coding-agent-CLI harness. Prototype
exposed data sources via an MCP stdio server; production
Worker's MCP persistence not explicitly disclosed.
Critic's weakly-adversarial stance
(concepts/weakly-adversarial-critic) verbatim: "The
weakly adversarial relationship between the Critic and the
expert group helps to mitigate against hallucinations and
variability in the interpretation of evidence." Canonical
worked example: the Critic caught a credential exposure
the Expert missed during a process-ancestry review, the
Director then pivoted the investigation to focus on the
credential issue, final report surfaced both the security
finding and the Expert's "analysis blind spots that require
attention." Verbatim: "What is notable about this result
is that the expert did not raise the credential exposure in
its findings; the Critic noticed it as part of its
meta-analysis of the expert's work." On-call shift
mode-shift: "we're switching to supervising investigation
teams, rather than doing the laborious work of gathering
evidence." 9 canonical wiki primitives: source + 1
system (systems/slack-spear) + 4 concepts
(concepts/knowledge-pyramid-model-tiering,
concepts/investigation-phase-progression,
concepts/weakly-adversarial-critic,
concepts/prompt-is-not-control) + 4 patterns
(patterns/director-expert-critic-investigation-loop,
patterns/one-model-invocation-per-task,
patterns/hub-worker-dashboard-agent-service,
patterns/phase-gated-investigation-progression).
Extends 5 pages:
systems/model-context-protocol (new Seen-in — first wiki
MCP instance inside an internal security-investigation
pipeline),
concepts/structured-output-reliability (new Seen-in —
structured output as multi-agent orchestration-boundary
contract),
patterns/specialized-agent-decomposition (new Seen-in —
canonical security-operations instance at the peer-Expert
layer, with supra-agent Director + meta-agent Critic on top),
patterns/multi-round-critic-quality-gate (new Seen-in —
live-investigation altitude variant, distinguished from
Meta's artifact-production rounds shape),
patterns/drafter-evaluator-refinement-loop (new Seen-in —
investigation-loop-with-third-layer variant, distinguished
from Lyft's retry-only shape by the Director who decides
progress/pivot/conclude). Scope disposition: Tier-2
on-scope decisively on multi-agent-architecture-
canonicalisation grounds. Opens the Slack security-
engineering axis on the wiki; first Slack security-
operations ingest; seventh Slack coverage axis after
developer-productivity, deploy-safety, test-framework-
integration, mobile-a11y, fleet-config-management,
build-systems, and edge-networking. Zero production
numbers disclosed (no throughput / latency / cost / FP
rate / token-usage); model families not disclosed;
Critic's rubric opaque; Spear name inferred from image
slugs not stated in post body. URL verbatim:
https://slack.engineering/streamlining-security-investigations-with-agents/. Sibling to Cloudflare AI Code Review (patterns/coordinator-sub-reviewer-orchestration) at the code-review altitude; sibling to Redpanda Openclaw (patterns/four-component-agent-production-stack) at the enterprise-agent-substrate altitude; sibling to Lyft localization (patterns/drafter-evaluator-refinement-loop) at the structured-retry-loop altitude.) - 2026-03-31 — sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness
(Slack edge-networking team retrospective on rolling out
HTTP/3 on the public edge and closing the
HTTP/3 probing gap first.
Existing SaaS observability vendors had zero native
HTTP/3 probe support; Slack's own
Prometheus Blackbox
Exporter fleet ("a cornerstone of our monitoring") was
TCP-shaped and couldn't speak QUIC/UDP. Intern Sebastian
Feliciano scoped, implemented, and open-sourced an
http3module into BBE upstream built on systems/quic-go — selected for "wide adoption across other open source technologies, as well as the first-class support it provides in creating http clients in go." Integration snippet:http3Transport := &http3.Transport{TLSClientConfig: tlsConfig, QUICConfig: &quic.Config{}}wrapped inhttp.Client. Architectural discipline that earned the upstream merge: "had to add this new logic while following the Blackbox Exporter's existing architecture, ensuring the new features maintained the tool's configuration patterns." Because internship timelines ≠ OSS merge timelines, Sebastian "took matters into his own hands and architected an in- house system that utilized the new upstream features for probing out HTTP/3 endpoints" — canonical instance of patterns/upstream-contribution-parallel-to-in-house-integration. Final payoff: HTTP/1.1 + HTTP/2 + HTTP/3 metrics unified in Grafana ("single pane of glass") - reliable HTTP/3 alerts + easier telemetry correlation.
New canonical pages (5): 1 system
(systems/prometheus-blackbox-exporter), 3 concepts
(concepts/client-side-black-box-probe,
concepts/http-3-probing-gap,
concepts/observability-before-migration), 1 pattern
(patterns/upstream-contribution-parallel-to-in-house-integration).
Extended: systems/quic-go (second wiki
instance at upstream-tooling altitude, distinct from the
PlanetScale HTTP/3 driver benchmark instance),
systems/prometheus (BBE axis added to the Airbnb-
observability-ingestion-dominated page), systems/grafana
(single-pane-of-glass unified multi-HTTP-version view),
concepts/http-3 (probing-gap framing added alongside
the existing Cloudflare latency framing),
concepts/observability (observability-as-migration-gate
altitude added), patterns/upstream-fixes-to-community
(upstream-a-whole-new-feature altitude added alongside the
existing fix-at-scale altitude). Scope takeaways
verbatim: "Monitor first, and migrate second. ... getting
observability right as a precursor to migration makes
everything faster"; "Contributing open source pays
dividends"; "Bet on your interns." Opens the Slack
edge-networking-and-synthetic-monitoring axis on the wiki;
sixth Slack coverage axis (after developer-productivity,
deploy-safety, test-framework-integration, mobile-a11y,
fleet-config-management, and build-systems). Tier-2 on-
scope decisively: real engineering retrospective with
architecture diagram, code snippet, upstream-PR reference,
specific library selection rationale, and explicit
migration-gate framing. Operational numbers thin (article
is about monitoring, not HTTP/3 edge perf): hundreds of
thousands of HTTP/3 endpoints to probe; zero SaaS vendors
supporting HTTP/3 out-of-box at investigation time; code
merged to BBE at pinned commit
bee8e9102a106bff63281ee9c64c7b1275ef21d0. URL verbatim:https://slack.engineering/from-custom-to-open-scalable-network-probing-and-http-3-readiness-with-prometheus/.) - 2025-11-06 — sources/2025-11-06-slack-build-better-software-to-build-software-better (Slack Quip/Canvas team retrospective on taking their build from 60 minutes to as low as 10 minutes (~6× speed-up) by adopting Bazel + Starlark and doing the unglamorous engineering work to benefit from them. Load-bearing insight: "Bazel's magic is contingent on the declared graph actually being a DAG of hermetic, idempotent actions" — the pre-existing build had cycles, non-hermetic action nodes, and cache keys with hundreds of parameters, giving a zero cache hit rate that no build tool could fix. Two concrete wins: severing the Python↔TypeScript dependency edge (saved ~35 min/build on its own; canonical patterns/decouple-frontend-build-from-backend-artifacts) and deleting in-process parallelization inside the frontend bundler so Bazel could schedule across bundles (canonical patterns/delete-inner-parallelization-inside-outer-orchestrator). Correctness oracle during refactor was a Rust byte-diff tool comparing old- and new-build artifacts (canonical patterns/diff-artifact-validator-for-build-refactor). Key new concepts: concepts/cache-granularity ("100 parameters, 2-3 always change" failure mode), concepts/idempotent-build-action (pre-refactor build mutated the working directory), concepts/layering-violation (frontend bundler fused business logic + orchestration + parallelization), concepts/separation-of-concerns applied across backend/frontend, Python/TypeScript, and application/build-code axes. Extends systems/bazel, concepts/hermetic-build, concepts/build-graph, concepts/cache-hit-rate with the zero-hit-rate structural-failure story. Operational numbers: 60 min → 10 min (best-case, cached+parallelised) / 12 min (average) / 30 min (cache miss). Opens the Slack build-system- engineering axis on the wiki; first Slack build-tooling ingest.)
- 2025-11-19 — sources/2025-11-19-slack-android-vpat-journey
(Slack Android team retrospective on triaging a 2024 third-
party VPAT audit conducted after the IA4 UI redesign;
8 recurring accessibility themes identified (7 resolved,
1 deferred as future work). Representative-of-all-four-
buckets worked example of [[patterns/vpat-driven-a11y-
triage]]: shovel-ready fixes assigned immediately; recurring
themes resolved at Slack Kit component-
library layer (
OutlinedTextFielderror-announcement,SKBannererror auto-announce,SKListAccessibilityDelegateoverridesCollectionInfoto exclude decorativeSK divideritems from TalkBack's "N items in a list" count — canonical instance of patterns/accessibility-delegate-override-for-semantic-fix; workspace-switcher drag-and-drop via Edit-mode + six-dot handle + TalkBack custom actions "Move before" / "Move after" invoked by three-finger tap orL/rdrawing gestures — canonical instance of patterns/custom-talkback-actions-as-gesture-alternative); platform-convention closures for top-app-bar-as-heading (WCAG 1.3.1; verified via native Google apps) and strikethrough-announcement (WCAG 1.3.1; verified via blind- community consultation) — canonical instances of concepts/wcag-platform-applicability-gap; error-icon redundancy resolving the colour-alone P3 theme — concepts/redundant-error-signalling (WCAG 1.4.1); scope-bounded deferral of WCAG 2.1.1 keyboard-nav because Slack Android has no tablet support. New canonical pages: source + 4 systems (systems/slack-android, systems/slack-kit, systems/talkback, systems/android-accessibility-framework) + 3 concepts (concepts/vpat-voluntary-product-accessibility-template, concepts/wcag-platform-applicability-gap, concepts/redundant-error-signalling) + 3 patterns (patterns/accessibility-delegate-override-for-semantic-fix, patterns/custom-talkback-actions-as-gesture-alternative, patterns/vpat-driven-a11y-triage). Extended: concepts/automated-vs-manual-testing-complementarity (new Seen-in: this post is the manual / third-party / periodic layer complementing the 2025-01-07 automated Axe-in-Playwright ingest). Opens the fourth axis of Slack coverage on the wiki: mobile accessibility engineering. Tier-2 borderline include on mobile-a11y-pattern- canonicalisation grounds; user explicit full-ingest override of prior batch-skip; same disposition as the 2025-01-07 and 2024-06-19 Slack developer-productivity ingests. No incident / latency / scale disclosures; pure engineering-process retrospective. URL verbatim:https://slack.engineering/android-vpat-journey/.) - 2025-10-23 — sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption
(Archie Gunasekara's follow-up to Slack's 2024 Advancing
Our Chef Infrastructure post describes phase two of
Slack's EC2 / Chef deploy-safety work: instead of migrating
to Chef Policyfiles (which
would have required every service team to rewrite roles,
environments, and cookbooks), Slack extended the existing
EC2 framework in two load-bearing changes. (1) Splitting
one production Chef environment into six AZ-bucketed
environments (
prod-1…prod-6) rolled out via a release train withprod-1as canary:prod-1receives new versions every hour (hot canary);prod-2→prod-6advance via release train, with the next version gated on the previous version completing the train. Why the canary is parallel rather than head-of-train: "updatingprod-1frequently with the latest version allows us to detect issues closer to when they were introduced" rather than testing cumulative-change artifacts at the canary. Boot- time mapping from AZ to environment via Poptart Bootstrap (Slack's cloud-init-phase AMI tool) ensures newly provisioned nodes inherit the AZ-bucket boundary from instance 0 — the explicit fix for the scale-out-picks-up-bad-config failure mode that per-node cron staggering didn't address. (2) Replacing cron-driven Chef runs with a signal-driven pull model via a new service called Chef Summoner that watches an S3 bucket populated by the existing Chef Librarian artifact- promotion service atchef-run-triggers/<stack>/<env>. Signal payload carriesSplay(randomised jitter),Timestamp, and a fullManifestRecord(artifact version, cookbook-versions map, S3 artifact pointer,upload_completeordering flag). Summoner deduplicates against local state (last-run-time + artifact-version), applies Splay, and triggerschef-client. Plus a fallback cron baked into every AMI that independently triggerschef-clientif Summoner hasn't run Chef in the last 12 hours — the recovery path for broken-Summoner deployments. Also enforces the 12-hour compliance SLA. Closes by marking the legacy EC2 platform feature-complete + maintenance-mode and previews a brand-new EC2 successor called Shipyard (service-level deployments + metric-driven rollouts + automated rollbacks) for teams that can't yet move to Bedrock. Canonical wiki primitives: 6 new systems (systems/chef + systems/chef-policyfiles + systems/chef-librarian + systems/chef-summoner + systems/poptart-bootstrap + systems/slack-shipyard), 6 new concepts (concepts/az-bucketed-environment-split + concepts/splay-randomised-run-jitter + concepts/signal-driven-chef-trigger + concepts/s3-signal-bucket-as-config-fanout + concepts/fallback-cron-for-self-update-safety + concepts/cookbook-artifact-versioning), 4 new patterns (patterns/split-environment-per-az-for-blast-radius + patterns/release-train-rollout-with-canary + patterns/signal-triggered-fleet-config-apply + patterns/self-update-with-independent-fallback-cron). Extends concepts/blast-radius with the fleet- configuration substrate altitude and systems/aws-s3 with the S3-as-config-fanout-bus altitude (previously canonicalised as object-store + CDC-log-store + tiered- cold-tier). Natural companion to the 2025-10-07 Deploy Safety retrospective — one level below the program-altitude framing, drilling into the EC2 / Chef substrate specifically. Tier-2 on-scope decisively: real distributed-systems internals, scaling-trade-off rationale, concrete operational disclosures (6 production environments / 12-hour compliance SLA / hourly promotions). Opens the Slack fleet-configuration-management axis.) - 2025-10-07 — sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change
(Retrospective on Slack's 18-month Deploy Safety Program —
90% reduction in customer impact hours from
change-triggered incidents (Feb-Apr 2024 peak → Jan 2025).
Load-bearing framing statistic: 73% of customer-facing
incidents were Slack-induced-change-triggered, particularly
code deploys. Three North Star goals across all deployment
systems for highest-importance services: 10-min automated
MTTR / 20-min manual MTTR / detect before 10% fleet
exposure. Canonical program metric: "Hours of customer
impact from high-severity and selected medium-severity
change-triggered incidents" — explicitly framed as
"imperfect analog of customer sentiment" sitting in a
three-layer chain
Customer sentiment <-> Program Metric <-> Project Metric. Five-axis investment strategy: invest widely + bias for action / known-pain first / invest further based on results / curtail least impactful / flexible roadmap — explicit framing that below-expectation projects are "not failures" but "critical input." Phase-change architectural shift: "Once automatic rollbacks were introduced we observed dramatic improvement in results." Composed Webapp-backend investment sequence (Q1 metric monitoring → Q2 manual rollback via automatic alerts → Q3-Q4 automatic rollback → Q4+ ≤10 min customer impact → pattern copied to Webapp frontend → centralised deployment orchestration system inspired by [ReleaseBot] + AWS Pipelines). Trailing-metric-patience discipline with 3-6 month delivery-to-impact lag; mid-stream sub-signals. Tool fluency discipline — "Use the tooling often, not just for the infrequent worst case scenarios." Direct-per-team- outreach discipline — "Not all teams and systems are the same." Canonical wiki primitives: (1) systems/slack-deploy-safety-program (new) — the program as a named wiki system. (2) systems/slack-releasebot + systems/slack-bedrock (new) — the named ReleaseBot inspiration + Bedrock substrate as stub pages. (3) concepts/change-triggered-incident-rate (new) — the justifying statistic. (4) concepts/customer-impact-hours-metric (new) — the program-metric-as-sentiment-analog choice. (5) concepts/pre-10-percent-fleet-detection-goal (new) — blast-radius-cap at fleet-level. (6) concepts/trailing-metric-patience (new) — the patience discipline. (7) patterns/automated-detect-remediate-within-10-minutes (new) — the 10-min/20-min MTTR pair. (8) patterns/centralised-deployment-orchestration-across-systems (new) — the multi-substrate deploy-orchestrator pattern. (9) patterns/invest-widely-then-double-down-on-impact (new) — the five-axis strategy. Extends patterns/fast-rollback with the fully-automated altitude variant, concepts/feedback-control-loop-for-rollouts with the organisational-altitude instance, concepts/blast-radius with the fleet-percentage rollout-gate altitude, concepts/dora-metrics with the "maintain development velocity" co-equal North Star, concepts/observability with the Q1-first-investment framing. Opens the Slack reliability-engineering axis on the wiki; first Slack production-reliability ingest.) - 2025-01-07 — sources/2025-01-07-slack-automated-accessibility-testing-at-slack
(Slack Frontend Test Frameworks team retrospective on
integrating Axe Core into Slack's
pre-existing Playwright E2E framework
as a custom-fixture extension. Two failed integration attempts
first: baking Axe into RTL's
render(blocked by Slack's customised Jest setup) and baking Axe into Playwright'sLocatorinteraction methods (blocked by Locator auto-wait semantics). Landing architecture:slack.utils.a11y.runAxeAndSaveViolations()on the pre-existing customslackfixture, invoked explicitly at page-ready. Canonicalises five reusable patterns: fixture- extension as integration surface for cross-cutting testing concerns, two-axis exclusion list (ticketed-known-issue + out-of-scope-by- design), severity- gated reporting (criticalonly at launch,serious/moderate/mildas future work), tri-mode opt-in test execution (A11Y_ENABLEflag composes on-demand local + scheduled nightly Buildkite + opt-in CI gate), alert- channel-to-Jira auto-ticket workflow (Slack alert channel spins up pre-populated Jira ticket with canonical label + Epic placement). Four new concepts: concepts/wcag-2-1-a-aa-scope, concepts/automated-vs-manual-testing-complementarity, concepts/playwright-locator-auto-wait, concepts/severity-filtered-violation-reporting. Three new systems: systems/axe-core, systems/axe-core-playwright, systems/jira. 91 tests in initial suite, non-blocking, WCAG 2.1 A+AA scope,criticalimpact only. Borderline Tier-2 ingest — developer-productivity rather than distributed-systems internals; same disposition as the 2024-06-19 Enzyme→RTL codemod ingest; the canonicalised patterns generalise to any automated-check-integration-into- existing-test-suite problem.) - 2024-08-26 — sources/2024-08-26-slack-unified-grid-how-we-re-architected-slack-for-our-largest-customers (Slack's 2024-08-26 retrospective on the Unified Grid project — a decade-after-launch re-architecture of the Slack client and backend's fundamental tenant-scoping assumption, shifting from workspace-centric to org-wide.)
- 2024-06-19 — sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library
(Retrospective on migrating 15,000+ Enzyme tests to React
Testing Library using an AST + LLM hybrid pipeline. AST-only
hit ~45% ceiling; LLM-only (Claude 2.1) was 40-60% with
wild variance; hybrid reached ~80% on selected evaluation
files. Per-test-case DOM captured by instrumenting Enzyme
mount/shallow. In-code annotation comments from AST pass shaped LLM output. Tool later open-sourced as@slack/enzyme-to-rtl-codemod. ~64% adoption across Slack's RTL migration; 338-file CI-nightly run produced ~500 auto- passing test cases (~22% documented developer-time savings, lower-bound.))