CONCEPT Cited by 3 sources
Observability before migration¶
Principle¶
Close the observability gap on the new path before the migration proceeds. Orgs that switch transports / protocols / platforms without first extending their telemetry to cover the new path find themselves unable to distinguish "is the new thing healthy?" from "is the new thing broken in a way we can't see?" — and that uncertainty is what stalls migrations at small traffic percentages.
Slack's 2026-03-31 post crystallises this as an explicit takeaway (verbatim:
"Monitor first, and migrate second. This should go without saying, but getting observability right as a precursor to migration makes everything faster. We know that the industry is going towards QUIC, but proving to ourselves that it's the right move long term enables us to invest more into its future."
Source: sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness).
Why the discipline¶
Absent the discipline, the typical failure shape is:
- Turn on the new path at 1% of traffic.
- Notice the in-process metrics look healthy.
- Try to ramp to 5%, see something anomalous in user reports, panic-roll-back.
- Can't tell whether the anomaly was a real regression on the new path or noise in the old path, because you can't compare apples to apples — the new path isn't probed the same way the old path is.
- Stall at 1% indefinitely, or revert the migration.
With the discipline:
- Before turning on the new path, build probes that speak it natively and surface its metrics alongside the old path's.
- Turn on 1%; compare new-path metrics to old-path metrics directly ("single pane of glass").
- Ramp based on measured equivalence or improvement; auto-rollback on measured regression.
- Migration completes because the telemetry substrate lets you tell safely.
Canonical instance¶
Slack rolling out HTTP/3 on the edge. The monitoring gap — documented as concepts/http-3-probing-gap — existed because SaaS observability tools and Slack's own Prometheus Blackbox Exporter fleet both lacked QUIC support.
The team closed the gap before scaling HTTP/3 rollout:
- Built an
http3BBE module on systems/quic-go following existing BBE configuration patterns. - Open-sourced it to Prometheus Blackbox Exporter upstream.
- Ran the contribution on the in-house fork in parallel while waiting for upstream merge (see patterns/upstream-contribution-parallel-to-in-house-integration).
- Unified HTTP/1.1 + HTTP/2 + HTTP/3 probe metrics in Grafana — the direct payoff of having observability ready before scale-up.
Generalises to¶
- Transport migrations — TCP→QUIC (this post), raw-socket → gRPC, HTTP → WebSocket / SSE.
- Protocol version migrations — MySQL-wire → MySQL-over-HTTP, HTTP/1.1→HTTP/2→HTTP/3.
- Auth-scheme migrations — session-cookie → OAuth token → JWT → mTLS.
- Platform / substrate migrations — EC2 → Kubernetes, VM → container → serverless, monolith → services.
- Region / topology migrations — single-region → multi-region, active-passive → active-active.
In each case, the same question applies: "before we turn on the new path, can we compare its behaviour to the old path's with the same fidelity?"
Versus just "instrument the new thing"¶
The discipline is stricter than "add monitoring to the new thing":
- Monitoring has to be comparable — the new-path metrics must line up with old-path metrics so side-by-side comparison is real, not symbolic. Slack's payoff metric is literally "unified view of HTTP/1.1, HTTP/2, and HTTP/3 metrics in Grafana, allowing for easier correlation with other telemetry and comparison."
- Monitoring has to be black-box / client-side — the old path had client-side probes, so the new path needs the same, not just in-process emitted metrics. Otherwise you're comparing a view-from-inside (new) to a view-from-outside (old).
- Monitoring has to land upstream where possible — otherwise every follow-on org hitting the same transport migration rebuilds the same probing code. Slack's open-source PR into BBE is the payoff vector: "this contribution benefits the wider Prometheus community, helping other organizations facing the same challenges with HTTP/3 adoption."
Composes with¶
- patterns/upstream-contribution-parallel-to-in-house-integration — when the observability-extension lives in an OSS tool, run the upstream PR in parallel with in-house integration so your migration isn't gated on maintainer timeline.
- patterns/upstream-fixes-to-community — contribute the observability extension back so downstream orgs don't repeat the work.
- concepts/observability-stack-partial-dependency — closing the observability gap on the new path shouldn't introduce a new dependency that itself becomes a SPOF for the migration.
Anti-pattern¶
"Ship it and we'll see." Turn on new-path traffic without new-path probes; rely on user complaints to detect regressions; roll back on the first user complaint without being able to tell if it was a real regression or a coincidence. The migration either stalls at 1% or reverses; the org concludes the new technology "isn't ready" when the actual problem was observability readiness.
Seen in¶
- sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness — Slack's HTTP/3 edge rollout explicitly gated on first closing the QUIC probing gap in systems/prometheus-blackbox-exporter.
- sources/2026-04-07-yelp-zero-downtime-cassandra-4x-upgrade — datastore-upgrade application of the discipline. Yelp built observability dashboards covering both Cassandra major versions + both Stargate proxy fleets before the upgrade started, monitoring "p99 latency and errors per keyspace for 3.11 and 4.1 instances". Specifically called out the load-bearing payoff: the dashboards caught the Stargate 2.x regression on range / multi-partition queries "early in our non-production environments" — before it reached prod, before it was ambiguous what was causing latency. The same dashboards distinguished the transient mixed-version latency from the Stargate regression, making one safe to ride out and the other urgent to address. Fifth named criterion in Yelp's upfront qualification list is verbatim: "Sufficient observability must be available to track the progress of upgrades." See patterns/production-qualification-criteria-upfront.
- sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines — project-progress altitude application of the discipline, distinct from the transport-protocol-probe altitude of the prior Slack HTTP/3 instance. Slack's SSH-deprecation initiative built an analytics dashboard backed by Airflow metadata-DB queries to identify remaining SSH-based tasks per team / per DAG / per region — before the migration started in earnest. Verbatim best-practice from the post: "Build monitoring before you migrate. Set up tracking dashboards early so you always know what's left to migrate. Airflow database queries made it easy to identify remaining work. Progress visibility kept the project moving." The dashboard's load-bearing role was at the organisational axis: 700+ jobs × 8 regions × 7 operator types × 5 teams is the kind of project where, without per-cell visibility, the long tail never gets retired and the migration's value never materialises. Distinct application of the same underlying discipline — don't migrate until you can see what's still on the old path.
Related¶
- concepts/observability — the umbrella concept.
- concepts/http-3-probing-gap — the specific gap Slack closed.
- concepts/client-side-black-box-probe — the probing primitive that has to cover the new path.
- concepts/performance-regression-from-mid-upgrade-state — what you cannot distinguish without observability-before- migration (transient vs genuine regression).
- patterns/upstream-contribution-parallel-to-in-house-integration — how to close the gap without stalling.
- patterns/production-qualification-criteria-upfront — observability-sufficient is one of the standard criteria.
- companies/slack — canonical adopter.
- companies/yelp — datastore-tier adopter.