Skip to content

CONCEPT Cited by 1 source

Bounded fan-out relevance cap

Bounded fan-out relevance cap is the discipline of capping the worst-case set that a scatter-gather fallback fans out over — by defining a user-specific "relevance" subset and restricting the fallback to that subset — rather than letting p99.9 latency be set by long-tail users whose fan-out set is pathologically large.

The canonical wiki instance is Slack's Unified Grid (Source: sources/2024-08-26-slack-unified-grid-how-we-re-architected-slack-for-our-largest-customers): when a legacy API's last-resort fallback is "iterate over the user's workspaces and try each shard," most users have a handful of workspaces and the fallback is "surprisingly performant." But Slack has a long tail of admin users in "hundreds of workspaces." Slack's solution: cap "relevant" workspaces per user at 50, with a manual-configuration escape hatch for the long-tail user. "Because such users are generally administrators who do not interact with all those workspaces, we decided to cap the number of 'relevant' workspaces at 50 and allow users to manually configure this list."

The pathology the cap addresses

A scatter-gather fallback's latency and cost scales linearly in fan-out size. When fan-out size is a per-user property ("iterate over the user's N tenants"), the worst-case is set by the user with the largest N, not the average. In a multi-tenant SaaS this means:

  • Regular users — 2–5 tenants. Fan-out is cheap.
  • Team admins — 10–30 tenants. Fan-out is noticeable.
  • Org / enterprise admins — 100s to 1000s of tenants. Fan-out is pathological — they generate a p99.9 tail.

Without a cap, the worst-case user dominates the SLO. Scaling the backend to handle their fan-out is wasteful because they don't actually use all their tenants interactively — they manage them.

Why the cap works

The cap exploits a usage asymmetry: the users with pathological tenant-set sizes (admins) don't actually interact with all their tenants. Their day-to-day work happens in a smaller "relevant" subset. Capping the fallback's fan-out to that subset:

  • Restores the SLO for admin users' fallback calls.
  • Preserves full access to the long tail via an explicit configuration surface ("allow users to manually configure this list") rather than silently dropping it.
  • Keeps the backend's worst-case bounded at the cap, not at the per-user tenant-count distribution's tail.

The three design knobs

  1. Cap value (Slack: 50). Set high enough that a typical admin doesn't hit it; low enough that fan-out stays within SLO.
  2. Default selection policy — when a user has more than the cap of tenants, which ones go in the relevance set? Likely: recency, activity count, membership role.
  3. User-configurable escape hatch — Slack exposes a manual list. Without this, admins whose genuinely-active set is

    cap would lose functionality.

Generalisation

The pattern is load-bearing for any multi-tenant fallback path where:

  • The fallback is scatter-gather over a per-user entity set.
  • The per-user entity-set size is heavy-tailed (admin users have 10× or 100× the count of typical users).
  • The long-tail users don't actually use all their entities interactively.

Examples likely to benefit:

  • GitHub organisations / repos — a user in 500 orgs; a fan-out over them is pathological; a recency-based relevance cap is a sensible mitigation.
  • Permissions-check fan-out in multi-cluster Kubernetes — a cluster-admin's fan-out over 1000s of namespaces; cap to recently-used.
  • Notification inbox aggregation across tenants — an operator in many tenants; cap to tenants with recent activity.

The anti-pattern: solve the admin long-tail problem by building tenant-independent aggregate indexes (expensive); solve it by rejecting admin-user access above some threshold (breaks the product). The relevance cap is cheaper than the first and preserves the product.

Distinction from rate-limiting / shedding

A relevance cap is not rate-limiting (which throttles request volume) and not load shedding (which drops work). It narrows the computational work per request for the long-tail users by shrinking the fallback's fan-out set. The long-tail user still gets a successful response — just over their relevant subset.

Seen in

Last updated · 470 distilled / 1,213 read