CONCEPT Cited by 1 source
Observability before migration¶
Principle¶
Close the observability gap on the new path before the migration proceeds. Orgs that switch transports / protocols / platforms without first extending their telemetry to cover the new path find themselves unable to distinguish "is the new thing healthy?" from "is the new thing broken in a way we can't see?" — and that uncertainty is what stalls migrations at small traffic percentages.
Slack's 2026-03-31 post crystallises this as an explicit takeaway (verbatim:
"Monitor first, and migrate second. This should go without saying, but getting observability right as a precursor to migration makes everything faster. We know that the industry is going towards QUIC, but proving to ourselves that it's the right move long term enables us to invest more into its future."
Source: sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness).
Why the discipline¶
Absent the discipline, the typical failure shape is:
- Turn on the new path at 1% of traffic.
- Notice the in-process metrics look healthy.
- Try to ramp to 5%, see something anomalous in user reports, panic-roll-back.
- Can't tell whether the anomaly was a real regression on the new path or noise in the old path, because you can't compare apples to apples — the new path isn't probed the same way the old path is.
- Stall at 1% indefinitely, or revert the migration.
With the discipline:
- Before turning on the new path, build probes that speak it natively and surface its metrics alongside the old path's.
- Turn on 1%; compare new-path metrics to old-path metrics directly ("single pane of glass").
- Ramp based on measured equivalence or improvement; auto-rollback on measured regression.
- Migration completes because the telemetry substrate lets you tell safely.
Canonical instance¶
Slack rolling out HTTP/3 on the edge. The monitoring gap — documented as concepts/http-3-probing-gap — existed because SaaS observability tools and Slack's own Prometheus Blackbox Exporter fleet both lacked QUIC support.
The team closed the gap before scaling HTTP/3 rollout:
- Built an
http3BBE module on systems/quic-go following existing BBE configuration patterns. - Open-sourced it to Prometheus Blackbox Exporter upstream.
- Ran the contribution on the in-house fork in parallel while waiting for upstream merge (see patterns/upstream-contribution-parallel-to-in-house-integration).
- Unified HTTP/1.1 + HTTP/2 + HTTP/3 probe metrics in Grafana — the direct payoff of having observability ready before scale-up.
Generalises to¶
- Transport migrations — TCP→QUIC (this post), raw-socket → gRPC, HTTP → WebSocket / SSE.
- Protocol version migrations — MySQL-wire → MySQL-over-HTTP, HTTP/1.1→HTTP/2→HTTP/3.
- Auth-scheme migrations — session-cookie → OAuth token → JWT → mTLS.
- Platform / substrate migrations — EC2 → Kubernetes, VM → container → serverless, monolith → services.
- Region / topology migrations — single-region → multi-region, active-passive → active-active.
In each case, the same question applies: "before we turn on the new path, can we compare its behaviour to the old path's with the same fidelity?"
Versus just "instrument the new thing"¶
The discipline is stricter than "add monitoring to the new thing":
- Monitoring has to be comparable — the new-path metrics must line up with old-path metrics so side-by-side comparison is real, not symbolic. Slack's payoff metric is literally "unified view of HTTP/1.1, HTTP/2, and HTTP/3 metrics in Grafana, allowing for easier correlation with other telemetry and comparison."
- Monitoring has to be black-box / client-side — the old path had client-side probes, so the new path needs the same, not just in-process emitted metrics. Otherwise you're comparing a view-from-inside (new) to a view-from-outside (old).
- Monitoring has to land upstream where possible — otherwise every follow-on org hitting the same transport migration rebuilds the same probing code. Slack's open-source PR into BBE is the payoff vector: "this contribution benefits the wider Prometheus community, helping other organizations facing the same challenges with HTTP/3 adoption."
Composes with¶
- patterns/upstream-contribution-parallel-to-in-house-integration — when the observability-extension lives in an OSS tool, run the upstream PR in parallel with in-house integration so your migration isn't gated on maintainer timeline.
- patterns/upstream-fixes-to-community — contribute the observability extension back so downstream orgs don't repeat the work.
- concepts/observability-stack-partial-dependency — closing the observability gap on the new path shouldn't introduce a new dependency that itself becomes a SPOF for the migration.
Anti-pattern¶
"Ship it and we'll see." Turn on new-path traffic without new-path probes; rely on user complaints to detect regressions; roll back on the first user complaint without being able to tell if it was a real regression or a coincidence. The migration either stalls at 1% or reverses; the org concludes the new technology "isn't ready" when the actual problem was observability readiness.
Seen in¶
- sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness — Slack's HTTP/3 edge rollout explicitly gated on first closing the QUIC probing gap in systems/prometheus-blackbox-exporter.
Related¶
- concepts/observability — the umbrella concept.
- concepts/http-3-probing-gap — the specific gap Slack closed.
- concepts/client-side-black-box-probe — the probing primitive that has to cover the new path.
- patterns/upstream-contribution-parallel-to-in-house-integration — how to close the gap without stalling.
- companies/slack — canonical adopter.