PATTERN Cited by 3 sources
Custom benchmarking harness¶
Intent¶
When a vendor-supplied benchmark tool doesn't match your workload shape or reports the wrong latency metric, write a narrowly-scoped custom harness against the same API. The cost is small (hours-to- days of engineering) and the upside is a reliable measurement substrate that actually answers your performance questions.
Context¶
Vendor benchmarks are usually written for performance-regression testing of the vendor's own system, not for comparing configuration variants on an existing deployment. Two typical failure modes:
- Unrealistic query distribution. Vendor tools emit templated queries, not the long tail of randomized real production shapes.
- Client-side latency only. They measure latency from their own client process, which convolves network, parser, and client GC noise into the result, and they ignore server-side latency fields (concepts/metric-granularity-mismatch) that would be the cleaner signal.
If your goal is apples-to-apples comparison across 5+ configs, non-reproducible results are a dead end — you can't tell if the config moved latency or the noise did.
Mechanism¶
- Use the target system's native client library (to reproduce real connection/TLS/compression behavior).
- Drive with real query shapes: either replay sampled production queries or synthesize across your known shape distribution.
- Record the server-side latency field (e.g. OpenSearch
took, gRPC response trailers) not wall-clock around the client call as the primary metric. - Emit a simple CSV / JSON so the variance across N runs is obvious.
- Small language, simple code. Go / Rust / Python is fine — the value is methodology, not architecture.
Canonical instance (Figma, 2026)¶
Figma tried OpenSearch's own opensearch-benchmark for shard-count /
node-type / compression sweeps. Two specific gaps:
[it's] designed to do performance regression testing for OpenSearch development, and isn't as good at sending huge numbers of randomized queries to existing OpenSearch instances.
strangely, it doesn't really like to use the server-side "took" latency number, which means that all latency metrics are based on client-side performance.
They wrote a custom Go load generator in an afternoon and got consistent, server-side-took-based measurements for their sweep of shard counts, node types, zstd compression, and concurrent-segment search — the data that drove patterns/fewer-larger-shards-for-latency|450→180 shard reduction.
Anti-patterns¶
- Assuming vendor defaults are workload-appropriate. The opensearch-benchmark tool's metric default is the clue that it's built for a different use case.
- Over-engineering the harness. A day of Go is the right budget; if you're building a distributed load-gen control plane you've lost the plot.
- Measuring once. You need the noise floor, not a point estimate — three runs minimum per config.
Seen in¶
- sources/2026-04-21-figma-the-search-for-speed-in-figma — Go
harness written in an afternoon to compare OpenSearch shard
counts / node types / compression modes using the server-side
tookfield; produced the shard-sweep data that justified 450 → 180 shards. - sources/2024-08-05-meta-dcperf-open-source-benchmark-suite —
hyperscale-microarchitecture sibling of the same design
stance at a different altitude: Meta built DCPerf because
SPEC CPU doesn't represent hyperscale
workloads at the microarchitectural level (IPC, core
frequency). Where Figma's harness is application-layer
(OpenSearch
took), DCPerf is microarchitecture-layer (CPU-vendor + capacity-planning consumers). Both are: vendor's default benchmark biases the signal → build one that matches our workload shape → validate at the level our consumer cares about. See patterns/workload-representative-benchmark-from-production for DCPerf's generalised statement. - sources/2026-04-21-planetscale-benchmarking-postgres —
database-vendor-comparison sibling.
Telescope is PlanetScale's
internal benchmarking harness, used to compare PlanetScale for
Postgres against Amazon Aurora, Google AlloyDB, CrunchyData,
Supabase, TigerData, and Neon. The harness drives three
benchmarks (latency via
SELECT 1;, TPCC via Perconasysbench-tpccat 500 GB, OLTP viasysbench oltp_read_onlyat 300 GB) with full reproduction instructions published. Same pattern as Figma's OpenSearch harness and Meta's DCPerf — build a harness that drives your workload shape, at an altitude the vendor-supplied tools don't reach, then combine with patterns/reproducible-benchmark-publication to make the results auditable.
Related¶
- patterns/measurement-driven-micro-optimization — the broader discipline. Cloudflare's per-function flavor (criterion microbench + production stack-trace sampling) and this workload-level harness flavor (custom Go OpenSearch driver) are two canonical shapes.
- patterns/load-test-at-scale — the broader pattern of testing at production-equivalent load before a config flip.
- patterns/fewer-larger-shards-for-latency — the decision Figma's custom harness enabled.
- concepts/metric-granularity-mismatch — the reason the vendor-benchmark client-side-only latency is insufficient.
- systems/criterion-rust — the complementary per-function microbench crate for the Rust side of the wiki (sources/2024-09-10-cloudflare-a-good-day-to-trie-hard).