CONCEPT Cited by 1 source
Vector search has no scale-to-zero¶
Definition¶
A production-cost observation about hosted vector-search systems: unlike stateless web services or general-purpose serverless compute, vector search endpoints typically cannot scale their compute to zero during idle periods. The vector index must remain warm in memory (or quickly reachable on NVMe) for the latency SLOs that justify using a vector index in the first place. The economic consequence: a vector endpoint incurs near-constant cost regardless of whether queries are flowing, which is "particularly relevant for the bursty, event-driven nature of industrial security data" and any workload whose query rate is highly variable.
The wiki canonicalises this from the Claroty CPS Library production disclosure:
"One area of strategic focus is the cost-efficiency of our Vector Search indices. While the performance is world-class, the current lack of a 'scale-to-zero' model for vector endpoints — a nuance particularly relevant for the bursty, event-driven nature of industrial security data — requires us to design specific architectural patterns to maintain high ROI during idle periods." (Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library)
Why vector search resists scale-to-zero¶
A vector index is a stateful, memory-resident data structure optimised for low-latency similarity queries. Most production indexes are some variant of HNSW, IVF, ScaNN, or SPANN — graph or partition structures that depend on resident memory access patterns to hit single-digit-millisecond p99.
Three structural reasons cold-starting a vector endpoint is expensive enough that providers don't offer scale-to-zero defaults:
- Index loading is large and IO-bound. Loading a billion-vector index from object storage to RAM takes minutes-to-tens-of-minutes; the index is gigabytes-to- terabytes per replica. The cold-start window dominates the latency budget for any user query that arrives during load.
- Warming caches takes more queries. Even after the index is in memory, OS page cache and query-plan caches need real traffic to warm. The first batch of queries after a cold restart sees materially worse p99 than steady-state.
- Multi-replica costs aggregate. A scaled-to-zero endpoint that wakes on first query has to spin up every replica it needs for HA before serving — the cold-start cost is multiplied by the replication factor, not amortised over it.
This is structurally analogous to but distinct from concepts/gpu-scale-to-zero-cold-start: the GPU case is about model weights + CUDA initialisation; the vector-search case is about index load + cache warm. Both share the property that "the resource that takes the longest to acquire is the dominant cost driver during cold start."
Economic consequence¶
For workloads with steady or near-steady query volume, the no-scale-to-zero property is fine — the always-on cost is amortised across many queries and the per-query cost is acceptable.
For bursty / event-driven workloads — like Claroty's industrial-security context, where vulnerability advisories arrive in unpredictable spikes and the surrounding 99% of the time the index is idle — the no-scale-to-zero property turns into a meaningful economic drag. The endpoint costs the same whether it served 10 queries or 10,000 queries that day.
Architectural mitigations (named, not detailed)¶
The Claroty source acknowledges the problem requires "specific architectural patterns to maintain high ROI during idle periods" but does not enumerate them. Architectural shapes that have appeared in adjacent wiki sources for analogous cold-start problems include:
- Pre-warmed pools. Keep a small fixed pool warm across zero-traffic windows; rely on it to serve the first burst while a larger fleet wakes. Cousins: patterns/regional-pre-warmed-do-container-pair-pool.
- Cache the answer, not the index. When a small set of queries dominate, cache the top-K results and short-circuit the index for cache hits. Index access becomes the cache-miss path.
- Co-tenant amortisation. A single endpoint serves multiple tenants whose burst patterns are uncorrelated, so the aggregate query rate is more even than any single tenant. Requires careful blast-radius and noisy-neighbour management (concepts/noisy-neighbor / concepts/cell-based-architecture).
- Shift work to batch where latency permits. Not every vector-search call needs sub-100 ms p99. Bulk classification of overnight CSAF-advisory ingest can run against an offline / temporary index spun up just for the batch.
- Lower-precision indexing when possible. Quantised embeddings (int8, fp8, scalar quantization) reduce the RAM footprint of the index, lowering the always-on cost — at a measurable accuracy tradeoff.
The Claroty source does not say which of these (if any) it uses; this is reserved for future ingests.
Adjacent concepts¶
- concepts/scale-to-zero — the general capability vector search is observed to lack.
- concepts/gpu-scale-to-zero-cold-start — structurally analogous in the GPU-inference context.
- concepts/cold-start — the broader cold-start phenomenon.
- concepts/bursty-query-pattern — the workload shape that makes no-scale-to-zero economically painful.
- concepts/serverless-workload-churn-cardinality — related operational pain in TSDBs from frequent serverless workload turnover.
Seen in¶
- sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library — Canonical wiki source for the observation. Claroty's CPS Library disclosure explicitly names "the current lack of a 'scale-to-zero' model for vector endpoints" as a cost-efficiency concern "particularly relevant for the bursty, event-driven nature of industrial security data." Architectural patterns to mitigate are referenced but not enumerated. This is the first wiki source to canonicalise the structural mismatch between vector-search-as-a-service cost models and bursty event-driven workloads.
Related¶
- concepts/scale-to-zero · concepts/gpu-scale-to-zero-cold-start · concepts/cold-start — the cold-start / scale-to-zero family of concepts.
- concepts/bursty-query-pattern — the workload shape that makes no-scale-to-zero economically painful.
- systems/mosaic-ai-vector-search — Databricks' hosted vector search; the system Claroty's observation pertains to.
- systems/claroty-cps-library — canonical wiki instance.