CONCEPT Cited by 1 source

Digital-twin backtesting¶

Digital-twin backtesting is the technique of running counterfactual or evaluation workloads against a high-fidelity simulated replica of a production system, seeded with real production state. The twin "replicates the state of real-world clusters" (or databases, networks, fleets) closely enough that answers produced inside the twin are treated as authoritative for the question being asked — even though no production traffic is touched.

Typical uses¶

Training-data generation for ML models that predict production behaviour. Run the authoritative solver / simulator inside the twin across many scenarios; harvest (input, output) pairs for supervised learning.
Counterfactual policy evaluation. Before rolling out a scheduler / placement / pricing change, replay historical state inside the twin under the new policy; compare outcomes to the actual history.
Pre-deployment validation. Stress-test a proposed change against the richest available production-like state without exposing real customers.
Fallback ground truth. Pair with a cheap ML approximator; when the approximator is uncertain, invoke the twin's authoritative solver instead.

Why it's distinct from "simulation"¶

A general-purpose simulator can be too abstract to be trusted as ground truth — it encodes the modeller's assumptions, not production reality.
A digital twin is specifically seeded with real state — the same inputs the production system saw — and the twin's fidelity is validated against production outcomes. It's the replication of production state that makes outputs trustworthy.

Canonical wiki instance¶

The 2025-07-29 Google Research post names Google's Borg digital twin as "a sophisticated backtesting framework to replicate the state of real-world clusters." The twin is used to:

Run Borg's specialised bin-packing algorithm against real cluster state to produce the ground-truth MIPS per GCU values that serve as RLM training targets.
Sit on the expensive-fallback side of the fast-path/ slow-path deployment — when the RLM is uncertain, the slow bin-packing run inside the twin is the authoritative answer (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

The post does not describe the twin's internals — it's used as a black box whose outputs are accepted as ground truth.

Prerequisites¶

Reproducible production state. The twin has to ingest real cluster / database / network state at sufficient detail that the authoritative solver's output matches production's.
Authoritative solver / policy. The twin is only useful if it runs the same scheduler / cost model / pricing engine as production. A twin that runs a different algorithm answers a different question.
Scale discipline. At Google cluster scale, backtesting at production fidelity is itself an expensive compute workload — not free.

Contrast with¶

Deterministic simulation is a testing discipline that replaces the scheduler / network / time with a seeded PRNG for reproducibility — the point is deterministic-under-adversarial-schedules, not production-fidelity.
Shadow traffic mirrors live requests to a candidate implementation. Digital-twin backtesting uses historical state, not live traffic, and runs the authoritative solver, not the candidate.
Replay testing replays a recorded workload against a candidate — same shape at a smaller scale.

Seen in¶

sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — Google's Borg digital twin used as training-data source for the RLM and implied fallback for high-uncertainty predictions.