CONCEPT Cited by 1 source
Time to know vs. time to recover¶
Definition¶
Time to know (TTK) and time to recover (TTR) are two adjacent but distinct incident-response KPIs. They partition the operational lifecycle of an incident:
- TTK — from detection (paged / alerted) to understanding (a probable cause + blast radius is known to the responder).
- TTR — from detection to mitigation complete (service back within SLO or fully recovered).
The adjacent industry terms are MTTD (mean time to detect — upstream of TTK), MTTK / MTTI (mean time to know / identify), and MTTR (mean time to recover). The Expedia STAR post uses TTK / TTR verbatim as the load-bearing optimisation target:
"Our objective with this service was to minimize the time to know (TTK) and time to recover (TTR). By enabling rapid analysis of observability data and evaluation of hypotheses, this service proved to be a valuable time-saving tool." (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer)
Why the split matters¶
TTK and TTR are compressible by different interventions:
| KPI | Compressed by | Canonical tool |
|---|---|---|
| TTK | Faster / richer diagnosis | Observability (concepts/observability), automated RCA (concepts/automated-root-cause-analysis), STAR-style analyzers |
| TTR | Faster action once diagnosed | Pre-approved runbooks (concepts/incident-playbook), auto-remediation, feature-flag rollback, deploy-reverse |
A system that only compresses TTR (e.g. one-click rollback) doesn't help if responders don't know what to roll back. A system that only compresses TTK (e.g. better dashboards) doesn't help if the diagnosis reveals a change outside the responder's mitigation toolkit.
STAR optimises the TTK side — it is a diagnosis tool, not a remediation tool.
Seen in¶
- Expedia STAR (2026-04-28) — canonical wiki instance. STAR names TTK + TTR explicitly as its headline KPIs for the incident-investigation use case. Positions automated telemetry analysis + LLM-assisted hypothesis evaluation as a TTK lever. No numbers published — the post is a design retrospective, not a measurement study.
Related¶
- concepts/observability — the upstream capability TTK depends on.
- concepts/automated-root-cause-analysis — the discipline STAR belongs to; directly compresses TTK.
- concepts/incident-playbook — pre-approved mitigation procedures compress TTR.
- concepts/incident-mitigation-lifecycle — the lifecycle view TTK/TTR partitions.
- patterns/multi-step-rca-workflow — STAR's 4-step workflow is a TTK-compression mechanism.
- systems/expedia-star — canonical wiki consumer.