CONCEPT Cited by 1 source

Symptom vs cause metric¶

Definition¶

A symptom metric summarises the health of a system by measuring the downstream observable of a chain of underlying queues and contention points, without identifying which one is the bottleneck. A cause metric (or direct metric) measures the state of a specific resource directly.

Example: "p99 query latency" is a symptom metric — slow queries could be caused by disk I/O, lock contention, page-cache misses, network jitter, or any combination. "Disk I/O wait time" is a cause metric — it identifies one specific queue.

The paradox: symptom metrics are more useful for alerting¶

Intuitively, cause metrics seem strictly better: they point you at the problem. But for alerting and throttling, symptom metrics are often preferable:

They catch unknown combinations. If a cause metric set is incomplete (and in practice it always is — some queue in the stack lacks instrumentation), an alert on cause metrics alone misses problems that manifest only as symptoms.
They match user perception. Users experience symptoms (slow, errors, timeouts). Alerting on symptoms aligns the operator's attention with the user's experience.
They summarise the whole chain. One symptom-level metric replaces N cause-level metrics where the operator would otherwise need to correlate across.

Shlomi Noach's framing¶

In Anatomy of a Throttler, part 1, Shlomi Noach articulates this directly about MySQL throttling metrics:

"What's different about this metric compared with replication lag is that it is much more of a symptom than an actual cause. If all of a sudden we see a sharp spike in active queries, this can indicate some possible causes: perhaps all are held by the commit queue, which for some reason stalls. Or, the queries happen to compete over a specific hotspot and wait on locks. … So what is it exactly that we need to monitor? Is the metric itself useless? Not necessarily. An experienced administrator may only need to take one look at this metric on the database dashboard to say 'we're having an issue'."

And the deeper insight:

"Circling back to replication lag, much like concurrent queries, it is a symptom. E.g. disk I/O is saturated on the replica, hence the replica cannot keep up replaying the changelog, thereby accumulating lag. … Whatever the case is, what's interesting is that the replication mechanism itself is a queue: the changelog event queue. … Each of these can be the major contributor to the overall replication lag, and yet, we can still look at replication lag as a whole — as a clear indicator for database health."

The key shift: a symptom metric is useful because it summarises an underlying queue. See concepts/queueing-theory.

How to use symptom metrics well¶

Alert / throttle on symptoms, diagnose on causes. A spike in p99 latency triggers an investigation; cause metrics narrow it down.
Prefer narrower symptoms when available. concepts/transaction-commit-delay is a narrower symptom than concepts/threads-running-mysql — both are symptoms, but commit delay isolates the disk-flush queue and therefore has a more stable threshold.
Prefer wait-time over queue-length when a choice is available. See concepts/queue-length-vs-wait-time.
Combine multiple symptoms. A throttler reading only one symptom is over-fitted to a failure mode. Multi-metric throttling catches orthogonal failure modes.

Canonical examples¶

Symptom metric	Underlying queue(s)
concepts/replication-lag	Changelog event queue (net + disk + apply + commit)
concepts/threads-running-mysql	Commit queue OR lock-wait queue OR page-cache-miss queue
concepts/transaction-commit-delay	Commit / fsync queue
concepts/load-average	Run-queue + uninterruptible-I/O queue
p99 query latency	All of the above, composed

Seen in¶

sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-1 — canonical wiki framing. Noach's Part 1 is the load-bearing articulation: he walks through five MySQL metrics (replication lag, threads_running, transaction commit delay, queue length, load average, pool usage) and identifies each as a symptom summary of some underlying queue. The useful ones (replication lag, pool exhaustion) are useful despite being symptoms, not despite not being causes.

concepts/queueing-theory — parent framing. Every symptom metric is the residence time (or length) of some queue in the stack.
concepts/use-method — Brendan Gregg's Utilization-Saturation- Errors triage sequence is a cause-metric workflow applied after a symptom-metric alert.
concepts/automated-root-cause-analysis — the systematic mapping of symptoms → causes in production.
concepts/database-throttler — the use case in this source.
concepts/queue-length-vs-wait-time — the wait-time-vs-length axis that cuts across symptom-vs-cause.