Skip to content

SLACK 2026-03-31 Tier 2

Read original ↗

Slack — From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus

Summary

Slack's edge-networking team had a monitoring gap when they started rolling out HTTP/3 on their public edge: existing SaaS and internal black-box probes could not speak HTTP/3 because HTTP/3 runs over QUIC over UDP, not TCP. Neither any of the SaaS observability tools Slack evaluated nor the internal Prometheus Blackbox Exporter (BBE) — Slack's canonical client-side black-box prober — had native QUIC support. An intern, Sebastian Feliciano, scoped and open-sourced HTTP/3 / QUIC support into Prometheus Blackbox Exporter upstream using systems/quic-go as the underlying client. Because merging into an upstream OSS project can't be timed to an internship, Sebastian also built an in-house BBE fork / integration using the new upstream code path so Slack could ship probing to production before the upstream PR merged. Slack now monitors HTTP/1.1, HTTP/2, and HTTP/3 edge endpoints side-by-side in Grafana"single pane of glass" — with reliable alerts on HTTP/3 health and easier correlation with other telemetry.

Key takeaways

  • HTTP/3 breaks existing black-box probes because the transport is UDP. Existing SaaS observability tools and Prometheus BBE probe HTTP over TCP; QUIC / HTTP/3 requires a UDP-speaking client with TLS 1.3 and the HTTP/3 request multiplexing semantics. Without that, there is zero client-side visibility into hundreds of thousands of HTTP/3 endpoints — no way to detect regressions to HTTP/2 or measure accurate round-trip times on the new transport. Canonical concepts/http-3-probing-gap datum (Source: this article).
  • "Monitor first, migrate second." Slack explicitly frames observability-before-migration as a takeaway: "getting observability right as a precursor to migration makes everything faster." This generalises beyond HTTP/3 — any transport- or protocol-level migration that loses visibility mid-flight stalls on fear of regression (Source: this article; canonical concepts/observability-before-migration instance).
  • Choice of QUIC client matters. Slack's upstream contribution picked systems/quic-go as the foundation for the new BBE HTTP/3 probe type, citing "wide adoption across other open source technologies, as well as the first-class support it provides in creating http clients in go" (Source: this article). Second canonical wiki instance of quic-go in production-adjacent tooling (first was PlanetScale's HTTP/3 MySQL driver benchmark).
  • Composability with existing tool shape. Sebastian "had to add this new logic while following the Blackbox Exporter's existing architecture, ensuring the new features maintained the tool's configuration patterns" — the upstream-friendly architectural discipline is what earns community-buy-in and landing a new protocol as a first-class module (Source: this article).
  • Parallel paths for internship-limited upstream contribution. "Making an open-source contribution as an intern is a huge accomplishment. As many of us know, maintainers don't always merge PRs quickly, especially for new features. Sebastian's internship timeline was limited, so he couldn't wait. Sebastian took matters into his own hands and architected an in-house system that utilized the new upstream features for probing out HTTP/3 endpoints" (Source: this article). Canonical patterns/upstream-contribution-parallel-to-in-house-integration instance.
  • "Single pane of glass" payoff. Slack ended up with HTTP/1.1, HTTP/2, and HTTP/3 metrics unified in Grafana, enabling side-by-side comparison, reliable HTTP/3 alerts, and easier correlation with other telemetry (Source: this article).
  • Open-sourcing pays dividends. "When a game changing protocol like QUIC comes through, and there's a gap in existing technologies supporting it, everyone wins when we fill the gap, and we win when everyone decides to support it long term." Extends the existing patterns/upstream-fixes-to-community pattern (previously canonical Shopify × Reanimated instance) with a second wiki instance at the upstream-a-whole-new-feature altitude rather than upstream-fixes-to-existing-feature.

Systems extracted

  • systems/prometheus — Tier-1 CNCF metrics TSDB, Slack's canonical metrics backend. Already a wiki page; this ingest extends it with the black-box probing / Blackbox Exporter axis.
  • systems/prometheus-blackbox-exporter (BBE) — new canonical page. Prometheus's official black-box prober for HTTP, HTTPS, DNS, TCP, ICMP, and (post-Sebastian) HTTP/3/QUIC. Documented as "a cornerstone of our monitoring" at Slack.
  • systems/quic-go — existing page (PlanetScale HTTP/3 benchmark instance). Extended with this ingest as the upstream-integrated QUIC library for the new HTTP/3 BBE probe.
  • systems/grafana — existing page. Extended with this ingest as the unified HTTP/1.1 + HTTP/2 + HTTP/3 dashboarding surface.

Concepts extracted

  • concepts/http-3 — existing page (Cloudflare-ingested). Extended with this ingest's monitoring-gap angle.
  • concepts/observability — existing canonical concept. Extended with the monitor-first-migrate-second discipline.
  • concepts/client-side-black-box-probe — new concept canonicalising the synthetic-monitoring altitude Slack operates at (agent-driven, request-shaped, protocol-aware probing from outside the service-under-test).
  • concepts/http-3-probing-gap — new concept naming the UDP-transport-breaks-TCP-probers failure class when protocols migrate.
  • concepts/observability-before-migration — new concept crystallising Slack's explicit takeaway that visibility must be in place before a transport / protocol / platform migration proceeds, else the migration stalls on fear of regression.

Patterns extracted

  • patterns/upstream-fixes-to-community — existing pattern (Shopify × Reanimated canonical). Extended with the Slack × BBE instance at the upstream-a-whole-new-feature altitude, a distinct shape from the existing fix-existing-bug-at-scale altitude.
  • patterns/upstream-contribution-parallel-to-in-house-integration — new canonical pattern. When you need a feature shipped into upstream OSS and cannot wait on the maintainer-merge timeline, build both paths in parallel: (a) open-source PR following existing project conventions to earn buy-in, (b) in-house integration that uses the same code path so you can ship to production immediately. When the upstream merges, the in-house integration becomes a thin veneer over the now-upstream code.

Operational numbers

  • Hundreds of thousands of HTTP/3 endpoints — the probe-target scale Slack needed to cover client-side before HTTP/3 could roll out safely (verbatim: "Without the ability to probe hundreds of thousands of HTTP/3 endpoints in our new infrastructure…").
  • Three HTTP versions unified — HTTP/1.1 + HTTP/2 + HTTP/3 in one Grafana view post-integration.
  • Zero SaaS observability vendors supported HTTP/3 probing "out of the box" at the time of the investigation.
  • One open-source PR landed — the HTTP/3 configuration for Prometheus Blackbox Exporter is documented at the pinned ref bee8e9102a106bff63281ee9c64c7b1275ef21d0.
  • Code shape disclosed:
    http3Transport := &http3.Transport{
        TLSClientConfig: tlsConfig,
        QUICConfig:      &quic.Config{},
    }
    client = &http.Client{Transport: http3Transport}
    
    — a thin wrapper over systems/quic-go's http3.Transport slotted into BBE's existing HTTP client abstraction.

Caveats

  • No disclosed production latency / success-rate numbers on the HTTP/3 edge rollout itself. The article is about the monitoring-gap and its resolution, not about HTTP/3 performance at Slack's edge specifically.
  • No disclosed scale of the probing fleet — number of probers, probe frequency, per-endpoint fan-out, sampling policy.
  • No disclosed incident or regression count — the probing enabled "reliable alerts" but the post does not quantify how many HTTP/3 regressions the new alerts caught.
  • PR-merge status is not disclosed in the body — the configuration-doc link shows the feature was eventually merged (the pinned commit SHA exists in github.com/prometheus/blackbox_exporter) but the post doesn't give a merge date or a pre-merge duration of the in-house fork. That's the load-bearing patterns/upstream-contribution-parallel-to-in-house-integration instance regardless: in-house integration gates on parallelism, not on merge.
  • No ClickHouse / warehouse / long-term-storage stack disclosure — probe metrics presumably go into Slack's Prometheus remote-write backend, but that isn't spelled out.
  • Recruiting-pitch tail — the final two paragraphs are internship-program PR and a hiring link; non-architectural, not ingested. The body proper is real engineering content (architecture diagram, code snippet, upstream-PR reference, operational-improvements bullet list, takeaways), comfortably on-scope for Tier-2 ingest.

Contradictions

None. Strictly additive to the existing systems/prometheus and concepts/http-3 corpora. The patterns/upstream-fixes-to-community pattern is extended, not contradicted.

Source

Last updated · 470 distilled / 1,213 read