PATTERN Cited by 1 source
Custom benchmarking harness¶
Intent¶
When a vendor-supplied benchmark tool doesn't match your workload shape or reports the wrong latency metric, write a narrowly-scoped custom harness against the same API. The cost is small (hours-to- days of engineering) and the upside is a reliable measurement substrate that actually answers your performance questions.
Context¶
Vendor benchmarks are usually written for performance-regression testing of the vendor's own system, not for comparing configuration variants on an existing deployment. Two typical failure modes:
- Unrealistic query distribution. Vendor tools emit templated queries, not the long tail of randomized real production shapes.
- Client-side latency only. They measure latency from their own client process, which convolves network, parser, and client GC noise into the result, and they ignore server-side latency fields (concepts/metric-granularity-mismatch) that would be the cleaner signal.
If your goal is apples-to-apples comparison across 5+ configs, non-reproducible results are a dead end — you can't tell if the config moved latency or the noise did.
Mechanism¶
- Use the target system's native client library (to reproduce real connection/TLS/compression behavior).
- Drive with real query shapes: either replay sampled production queries or synthesize across your known shape distribution.
- Record the server-side latency field (e.g. OpenSearch
took, gRPC response trailers) not wall-clock around the client call as the primary metric. - Emit a simple CSV / JSON so the variance across N runs is obvious.
- Small language, simple code. Go / Rust / Python is fine — the value is methodology, not architecture.
Canonical instance (Figma, 2026)¶
Figma tried OpenSearch's own opensearch-benchmark for shard-count /
node-type / compression sweeps. Two specific gaps:
[it's] designed to do performance regression testing for OpenSearch development, and isn't as good at sending huge numbers of randomized queries to existing OpenSearch instances.
strangely, it doesn't really like to use the server-side "took" latency number, which means that all latency metrics are based on client-side performance.
They wrote a custom Go load generator in an afternoon and got consistent, server-side-took-based measurements for their sweep of shard counts, node types, zstd compression, and concurrent-segment search — the data that drove patterns/fewer-larger-shards-for-latency|450→180 shard reduction.
Anti-patterns¶
- Assuming vendor defaults are workload-appropriate. The opensearch-benchmark tool's metric default is the clue that it's built for a different use case.
- Over-engineering the harness. A day of Go is the right budget; if you're building a distributed load-gen control plane you've lost the plot.
- Measuring once. You need the noise floor, not a point estimate — three runs minimum per config.
Seen in¶
- sources/2026-04-21-figma-the-search-for-speed-in-figma — Go
harness written in an afternoon to compare OpenSearch shard
counts / node types / compression modes using the server-side
tookfield; produced the shard-sweep data that justified 450 → 180 shards.
Related¶
- patterns/measurement-driven-micro-optimization — the broader discipline. Cloudflare's per-function flavor (criterion microbench + production stack-trace sampling) and this workload-level harness flavor (custom Go OpenSearch driver) are two canonical shapes.
- patterns/load-test-at-scale — the broader pattern of testing at production-equivalent load before a config flip.
- patterns/fewer-larger-shards-for-latency — the decision Figma's custom harness enabled.
- concepts/metric-granularity-mismatch — the reason the vendor-benchmark client-side-only latency is insufficient.
- systems/criterion-rust — the complementary per-function microbench crate for the Rust side of the wiki (sources/2024-09-10-cloudflare-a-good-day-to-trie-hard).