Skip to content

SYSTEM Cited by 1 source

Netflix Latency Monkey

Latency Monkey is the Simian Army member that induces artificial delays in Netflix's RESTful client-server communication layer to simulate service degradation and — at large-enough delays — full dependency downtime, without physically disabling any instance. Introduced alongside Chaos Monkey in Netflix's 2011 TechBlog post (Source: sources/2026-01-02-netflix-the-netflix-simian-army).

Purpose

"Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down."

Why latency-injection is a distinct tool

Chaos Monkey kills an instance; every caller of that instance experiences a binary state change (alive → gone) through the load balancer. Latency Monkey is more surgical:

  • It injects failure at the client-server boundary, not at the instance level — the tested service sees slow / unreachable dependencies, while the rest of the fleet's experience of those dependencies is unchanged.
  • Graduated intensity: modest delays simulate degradation and exercise timeout / circuit-breaker behaviour; large delays simulate outage and exercise graceful-degradation paths.
  • No teardown and restart cost: injection is reversible by unconfiguring the delay, rather than waiting for an instance to come back up in the ASG.

This makes Latency Monkey particularly useful for the case Netflix explicitly calls out: "testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system."

Design notes from the 2011 post

  • Injection point: the RESTful client-server communication layer — i.e. Netflix's inter-service RPC path.
  • Two use modes: service degradation (modest delays) and simulated outage (very large delays).
  • Observability claim: "measures if upstream services respond appropriately" — the monkey includes verification, not just injection.

Architectural implications

  • Requires a dependency-failure posture in every caller (timeouts, retries, circuit breakers, fallbacks). See concepts/graceful-degradation.
  • Surfaces implicit tight coupling: services whose own availability collapses when a non-critical dependency becomes slow are failing the graceful-degradation contract.
  • Complements systems/netflix-chaos-monkey — Chaos Monkey validates "I can lose any instance"; Latency Monkey validates "I can tolerate any dependency becoming slow."

Operational numbers

None disclosed in the 2011 post.

Seen in

Last updated · 319 distilled / 1,201 read