title: Slack — Unified Grid: How We Re-Architected Slack for Our Largest Customers type: source created: 2026-04-24 updated: 2026-04-24 tier: 2 company: slack published: 2024-08-26 url: https://slack.engineering/unified-grid-how-we-re-architected-slack-for-our-largest-customers/ raw: raw/slack/2024-08-26-unified-grid-how-we-re-architected-slack-for-our-largest-cus-1170ab7d.md hn_points: 34 hn_url: https://news.ycombinator.com/item?id=41355864 tags: [slack, enterprise-grid, unified-grid, re-architecture, workspace-sharding, org-wide, tenant-scoping, vitess, sharding, session-token, bounded-fan-out, prototype-the-path, two-pass-api-migration, parallel-integration-tests, dogfooding, ia4, xws, cross-workspace-channels] systems: [enterprise-grid, unified-grid, slack-rtm, vitess] concepts: [workspace-scoped-to-org-wide-migration, session-token-embedded-routing-context, bounded-fan-out-relevance-cap, horizontal-sharding] patterns: [prototype-the-path, two-pass-api-migration, parallel-integration-test-suite-for-context-switch] related: [systems/enterprise-grid, systems/unified-grid, systems/vitess, concepts/workspace-scoped-to-org-wide-migration, concepts/session-token-embedded-routing-context, concepts/bounded-fan-out-relevance-cap, concepts/horizontal-sharding, patterns/prototype-the-path, patterns/two-pass-api-migration, patterns/parallel-integration-test-suite-for-context-switch, companies/slack]
Slack — Unified Grid: How We Re-Architected Slack for Our Largest Customers¶
Summary¶
Slack's Unified Grid project (2021–March 2024) replaced the
workspace-scoped client and backend architecture that had
anchored the product since 2013 with an org-wide architecture
serving Enterprise Grid customers. The motivating problem: by the
late 2010s a significant portion of Enterprise Grid users belonged
to multiple workspaces on the same Grid, but the client forced
them to switch between workspaces to see org-wide data —
causing context switching, missed activity, and an ever-growing
class of bugs from syncing the same data across multiple
workspaces. The fundamental architectural assumption Slack shipped
with in 2013 — "almost all data is particular to a single
workspace" — was no longer true. Rather than stacking more
compatibility layers on top of that assumption, Slack took the
unusual step of revising the foundational assumption: the
client now boots with an org-wide view of channels and DMs, and
every API and permission check that was implicitly workspace-scoped
had to be retaught to operate without a workspace context. This
touched thousands of APIs and permission checks across most of
Slack's product engineering teams. The key architectural
preconditions were already in place: the Vitess
migration had re-sharded the most important tables (notably
messages, by channel ID) along axes other than workspace or org
ID, so many APIs could simply drop the workspace context; and the
real-time messaging (RTM) stack had been reworked to stop
fanning-out org-wide data to every workspace on the Grid.
Three canonical strategies handled each broken API: (1) drop
workspace routing entirely if Vitess had re-sharded the underlying
table; (2) prompt the user to pick a workspace when the operation
genuinely acted on one (e.g. channel creation); (3) as last
resort, iterate over the user's relevant workspaces and try each
shard — bounded by capping "relevant" workspaces at 50 per
user (with admin users the long-tail carve-out). Unified Grid
became the foundational layer beneath IA4
(the Activity/DMs/Later tabs redesign). Dogfood rollout: Summer
2023 → company-wide internal. Customer rollout: Fall 2023 →
complete March 2024.
Key takeaways¶
- The founding architectural assumption can become the bottleneck, and sometimes you have to revise it rather than compensate for it. Slack launched with "each user belongs to a single workspace", and the codebase encoded that assumption everywhere: session tokens carried a workspace ID, the backend parsed the workspace ID and used it to route queries and perform access control, client code stored data per workspace. When the usage pattern shifted and Enterprise Grid users routinely belonged to many workspaces on the same Grid, the compensating patches (Threads view, Unreads view, cross-workspace channels) were starting to strain the workspace-centric shape. The explicit verbatim framing: "when the architecture of an application drifts far enough from how that application is used, prototyping a path towards rewriting the core foundation is actually the best way to achieve your goals." (Source: sources/2024-08-26-slack-unified-grid-how-we-re-architected-slack-for-our-largest-customers)
- Infrastructure migrations are architectural preconditions, not incidental line items. Unified Grid was only viable because two prior infra projects had already decoupled data from the workspace-routing axis: the Vitess migration re-sharded key tables (messages sharded by channel ID; Source: sources/2024-08-26-slack-unified-grid-how-we-re-architected-slack-for-our-largest-customers cites the prior scaling-datastores-at-slack-with-vitess post), so queries no longer needed a workspace ID to find the right shard. The RTM stack rework had removed the fan-out of org-wide data to every workspace on the Grid (some customers have "thousands of workspaces"). Without these two, Unified Grid would have needed to ship them concurrently — multiplying the blast radius of each. Lesson: the architectural re-axis is cheaper to do when the data layer has already been decoupled from the old axis.
- "Prototype the path" — use dogfood before scale-out to validate the re-architecture is possible. Slack coins (and the wiki canonicalises as) patterns/prototype-the-path: rather than start by tackling thousands of broken APIs, they built a proof-of-concept Slack client that could boot in Unified Grid mode and used it themselves — "we are some of the heaviest users of Slack, we knew that if we could use Unified Grid in our day-to-day work, we'd start getting good signals about what did and didn't work." Execs were onboarded; then-CEO Stewart Butterfield's "This is obviously better" was the signal the investment was worth the effort. This differs from patterns/prototype-before-production (Figma multiplayer simulator) in that the prototype becomes the production path via incremental fixes rather than informing a rewrite. The prototype is the wedge, not the research artifact.
- Bound the worst-case fan-out to make the scatter-gather fallback performant. When an API couldn't be re-routed (strategies 1 and 2 didn't apply), Slack's third strategy was to iterate over the user's workspaces and try each shard. "Because most users are in only a handful of workspaces, this approach is surprisingly performant. However, there is a long tail of users in hundreds of workspaces." Slack's solution: cap "relevant" workspaces at 50 per user, with manual user configuration. Admins (who have the highest workspace-membership counts) are the long tail; they don't interact with all their workspaces, so the cap restricts the fan-out to the set they actually use. This is the canonical wiki instance of concepts/bounded-fan-out-relevance-cap — a tenant-centric worst-case-fan-out guardrail. See also concepts/scatter-gather-query for the query shape.
- Two-pass API migration decouples "make it work" from "make it correct." Most Slack APIs were marked as Unified-Grid- compatible via a first pass that made the API work well enough for internal usage; a second pass — "perhaps weeks later" — fixed integration tests, permission checks, and edge cases. Rationale: "This two-phase approach allowed us to manually verify and get a feel for functionality which was not entirely ready for primetime." Canonicalised as patterns/two-pass-api-migration. Natural pairing with the prototype-the-path rollout: first-pass APIs ship to employees, second-pass fixes land before customer rollout. Avoids the all-or-nothing gate that would block dogfooding behind perfect correctness.
- Reuse existing integration tests by swapping the context axis. Slack created a "parallel integration test suite which ran all our existing integration tests using org context instead of workspace context." "This let us reuse thousands of tests rather than rewriting them from the ground up." Hundreds of test suites were initially broken — those became the actionable list to fix per API. Canonicalised as patterns/parallel-integration-test-suite-for-context-switch. The discipline: don't fork your test corpus; axis-swap it and inherit coverage.
- Session tokens as routing-context carriers are a hidden coupling to tenant shape. Slack's 2013 design: session tokens ("workspace tokens") contained the user ID and the workspace ID. "The backend then parsed the workspace ID and used it to associate each API request with a workspace, route queries to that workspace's database shard and perform access control." This is efficient and safe in a single-tenant-per-session world — but it couples every authenticated request to the workspace routing axis. Canonicalised as concepts/session-token-embedded-routing-context. Moving away from workspace tokens meant re-teaching thousands of code paths what identity and routing look like when the token doesn't carry a tenant ID. The framing: session tokens are not just authentication envelopes; they are routing-context envelopes, and that's a design decision with decade-scale consequences.
- Unified Grid became a foundational dependency for the IA4 client redesign. Rather than ship IA4 (the Activity / DMs / Later tabs redesign) and Unified Grid as separate large changes — which would have subjected customers to two simultaneous disruptions — Slack coupled them: "Unified Grid became a foundational component of IA4, and with it a top company priority." This is the classic "new information-architecture is invalid under old data-model" forcing function — the IA4 tabs (Activity, DMs, Later) are org-wide-by-design surfaces; they cannot be built on top of a workspace-scoped data model. The rewrite becomes the price of the product bet. Indirect implication: IA4's existence means the old workspace-centric architecture would have had to be rewritten anyway, on a different schedule, to serve IA4 — so the two-for-one bundling is actually the lower-risk path.
Systems / concepts / patterns extracted¶
Systems (new canonical pages): - systems/enterprise-grid — Slack's 2017 product for large customers; introduced the org as a "parent" to multiple workspaces, cross-workspace (XWS) channels stored at org-shard level, permissions layered at workspace and org levels, shared billing and admin surface. - systems/unified-grid — the 2023–2024 re-architecture: the client boots with an org-wide view of all channels and DMs the user can access across the Grid; no workspace switching required for cross-workspace data; workspace filter available as a user-facing filter rather than a structural scope.
Concepts (new canonical pages): - concepts/workspace-scoped-to-org-wide-migration — the specific multi-tenant re-axis: tenant-scoped-per-session → org-scoped-with-intra-org-sub-tenants. Not a generic "monolith-to-multi-tenant" migration; the inverse — consolidating a per-tenant session model into a per-org session model where the tenant (workspace) becomes a sub-filter. Has siblings in any enterprise SaaS whose early design assumed one tenant per user and later needed org-wide views (e.g. GitHub organisation → enterprise, Atlassian workspace → Atlassian Cloud, Notion workspaces). - concepts/session-token-embedded-routing-context — session tokens that carry tenant/shard routing info in addition to authentication material. Efficient routing at request-parse time but a structural coupling: the session's granularity determines how much data you can expose in a single authenticated request without re-authenticating. Distinct from concepts/idempotency-token (replay-safety) and concepts/discharge-token (attenuation). - concepts/bounded-fan-out-relevance-cap — the discipline of capping the worst-case set that a scatter-gather fallback fans out over. Slack caps "relevant workspaces" per user at 50; admins (the long tail) can manually configure which subset matters. Generalises: whenever a fallback path is scatter-gather across a per-tenant set, the p99.9 is set by the long-tail tenant-set size; bounding the set restores SLO.
Patterns (new canonical pages): - patterns/prototype-the-path — Slack's named methodology for re-architecture. Build a barely-working prototype, dogfood it internally, let the painful edge cases surface via real usage, fix them one by one, expand the user base in concentric rings (core team → engineering org → execs → whole company → customers). Distinct from patterns/prototype-before-production in that the prototype becomes the production system, not a throwaway. - patterns/two-pass-api-migration — first pass makes the API work well enough for internal users; second pass fixes integration tests, permission checks, edge cases. Explicit separation of "ready for employees" from "ready for customers." Avoids the correctness-gate that would block dogfooding. - patterns/parallel-integration-test-suite-for-context-switch — reuse your existing integration test corpus by running it under a different context (org-context instead of workspace-context). The broken tests are the actionable list; you inherit coverage instead of forking the corpus.
Operational numbers disclosed¶
- Cap on "relevant" workspaces per user in the worst-case fan-out: 50. Admin users can manually configure the list. This is the single hard number in the post.
- Timeline:
- 2013 — Slack launches with single-workspace model.
- 2017 — Enterprise Grid introduces the org container.
- Pre-Unified-Grid — Vitess migration re-shards key tables
(notably
messagesby channel ID), RTM stack reworked to stop org-wide fan-out. - ~2021–2022 — Unified Grid prototype built; internal dogfooding begins.
- Summer 2023 — "much of the company was using it for their day-to-day work."
- Fall 2023 — customer rollout begins.
- March 2024 — customer rollout complete.
- Fan-out pathology qualitative: "some of our largest customers have thousands of workspaces!" — justifies the RTM org-wide-fan-out removal as prerequisite, and the 50-cap (otherwise worst-case fan-out is thousands).
- API surface: "thousands of APIs, database queries, and permissions checks" impacted — specific count not given, but ordered-of-magnitude disclosure.
- Engineering scope: "scores of engineers across most of Slack's product engineering teams."
Caveats¶
- Single-post vendor retrospective voice — 2024-08-26 post published on slack.engineering by unsigned author(s) framed as a retrospective on a project that had completed rollout "in March 2024" five months earlier. Written in product-marketing-adjacent register ("Execs were…concerned about the cost", "This is obviously better") but the architectural content is substantive.
- Zero production numbers outside the 50-cap and the timeline — no latency deltas, no error-rate deltas, no capacity disclosures, no cost disclosures. The fan-out cap is the sole quantitative design parameter disclosed.
- Mechanism depth is thin on the boot API — the post mentions a "new boot API which returns data for all the workspaces and channels the user belongs to across the entire Grid" and that clients store this data at the org-level, but the wire protocol, payload size, paging model, and incremental-update mechanism are not walked. The RTM-stack rework that made this feasible is referenced but not described.
- Vitess-migration-as-precondition is asserted, not mechanism- detailed. The post cites a prior scaling-datastores-with-Vitess post (linked but not mirrored here) for the messages-by-channel-ID sharding decision. The implication is that Slack had already done the work to re-shard the hot tables off the workspace axis; exact shard-key choices for other tables (channels, users, memberships, permissions) are not disclosed.
- The three API-fix strategies are presented as a decision tree but not as a taxonomy with counts. We don't know what fraction of APIs fell into each bucket (drop-workspace-routing vs prompt-for-workspace vs per-workspace-iterate), which would let us sanity-check the performance claim that per-workspace iteration was "surprisingly performant" for the typical case.
- The permissions-check migration is even thinner than the API migration. The post mentions "convenience helpers to correctly fetch channels and perform permissions checks across all a user's workspaces on their Enterprise Grid" but the permission model itself (workspace-admin + org-admin composition across XWS channels) is summarised, not detailed.
- Rollout-risk disclosure is absent. A March-2024 rollout of a multi-year re-architecture of the primary enterprise product's primary client, affecting thousands of APIs, is a candidate for a full production-incident retrospective. The post discloses none. Either the rollout went cleanly or the rollout-risk details are withheld — post does not say.
- No client-data-migration depth. "Some clients added an org-level data store but continued to save some data in workspace-scoped repositories, while other clients moved everything to an org-wide store." The three clients (desktop, iOS, Android?) are not named; which took which path is not disclosed; the data-migration mechanism for in-place users is not walked.
- No alternative considered. The post does not engage with the question "could you have kept the workspace-centric model and added an aggregated-view compatibility layer?" explicitly. The implicit answer is the 2010s patches (Threads view, Unreads view) were that compatibility layer and they failed to keep up — but the failure mode is narrated, not systematically analysed.
Source¶
- Original: https://slack.engineering/unified-grid-how-we-re-architected-slack-for-our-largest-customers/
- Raw markdown:
raw/slack/2024-08-26-unified-grid-how-we-re-architected-slack-for-our-largest-cus-1170ab7d.md
Related¶
- companies/slack — second Slack ingest on the wiki (concurrent pipeline completed the 2024-06-19 AI-powered Enzyme→RTL ingest during this session); anchors the Slack backend-architecture axis alongside the frontend-developer-productivity-tooling axis.
- systems/enterprise-grid — the 2017 multi-workspace org product that this post re-architects the client over.
- systems/unified-grid — the 2024 re-architected client / backend shape.
- systems/vitess — the prior sharding substrate that made dropping workspace-based routing feasible for key tables.
- systems/slack-rtm — the RTM stack that was reworked to stop org-wide fan-out before Unified Grid was viable.
- concepts/workspace-scoped-to-org-wide-migration — the specific multi-tenant axis-change this post canonicalises.
- concepts/session-token-embedded-routing-context — the routing-coupling the workspace-scoped model bakes into session tokens.
- concepts/bounded-fan-out-relevance-cap — the 50-workspace cap pattern for admin long-tail users.
- concepts/horizontal-sharding — the re-sharding axis (workspace ID → channel ID via Vitess) that made the re-architecture feasible.
- concepts/scatter-gather-query — the query shape the per-workspace-iteration fallback embodies.
- patterns/prototype-the-path — Slack's named methodology for dogfood-driven incremental re-architecture.
- patterns/two-pass-api-migration — the make-it-work-then- make-it-correct discipline.
- patterns/parallel-integration-test-suite-for-context-switch — the test-reuse pattern via context-axis swap.
- patterns/prototype-before-production — the research-simulator sibling pattern; contrast: prototype is throwaway vs prototype-the-path where the prototype becomes the production path.
- concepts/incremental-delivery — the broader posture of shipping architectural change as small reversible steps; Unified Grid is a large-scale instance.