PATTERN Cited by 1 source
Full-stack IO instrumentation¶
Intent¶
Instrument every IO at multiple points in every subsystem of the storage stack — plus run continuous canary workloads — so that (a) the true source of latency/variance can be localized across layers and (b) every subsequent change is falsifiable.
Motivation (EBS, 2012)¶
At this point in EBS's history (2012), we only had rudimentary telemetry. To know what to fix, we had to know what was broken, and then prioritize those fixes based on effort and rewards.
EBS had a performance problem but couldn't attribute it. The turnaround began not with a code change but with a measurement change.
Components¶
- Per-IO, per-layer instrumentation. Timestamps and counters at:
- the client initiator (software in the EC2 instance/hypervisor),
- the network stack,
- the storage durability engine,
- the operating system (kernel + driver queues).
- Continuous canary workloads. Known-shape benchmarks running 24×7 against the live fleet. Any perf regression from a deploy shows up in the canary before customers see it.
- Customer-workload monitoring. Shape of real traffic tracked against the canary as a drift detector.
Why it earns every other change its keep¶
- Changes ship with a falsifiable hypothesis. "This change should reduce p99 at layer X by Y%" can be checked.
- Regressions are caught early. Canaries fail before customer tickets land.
- Cross-layer attribution becomes possible. Combined with patterns/loopback-isolation you can say "the Xen dom0 queue accounts for X µs of our current p99."
Related¶
- patterns/loopback-isolation — the method that complements full-stack instrumentation: replace a layer with near-zero-latency stub, remeasure, attribute.
- concepts/observability — broader pattern of which this is the storage-IO specialization.
Seen in¶
- sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws — named in the "If you can't measure it, you can't manage it" section as the enabler of everything that followed at EBS.