Skip to content

SYSTEM Cited by 1 source

Netflix Doctor Monkey

Doctor Monkey is the Simian Army member that detects unhealthy instances via health checks + external signals and evicts them from service, eventually terminating them. Introduced in Netflix's 2011 TechBlog post (Source: sources/2026-01-02-netflix-the-netflix-simian-army).

Purpose

"Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated."

Role in the Simian Army

Doctor Monkey is a drift detector — like systems/netflix-conformity-monkey and systems/netflix-security-monkey — not a fault injector. The failure mode is "instance is silently broken but still serving." Doctor Monkey's job is to catch that silent-broken state and make it visible-broken, by removing the instance from rotation and (eventually) terminating it.

Two-phase eviction

The post describes a deliberate two-phase eviction flow:

  1. Remove from service — immediate action once the unhealthy signal fires. Traffic stops flowing to the instance.
  2. Eventually terminate — after "giving the service owners time to root-cause the problem." The instance is kept alive (out of rotation) long enough for the owning team to log in and investigate; termination is the cleanup step once investigation is done.

This two-phase shape balances automated remediation (stop serving bad requests immediately) with operational learning (keep the evidence around long enough to debug). Classic SRE-before-the-term pattern.

Health signals

Two signal categories are named:

  • Health checks that run on each instance — the service-level health probe, custom per service.
  • External signs of health (e.g. CPU load) — infrastructure- level observability independent of the service's own health endpoint.

Doctor Monkey fuses both. The post doesn't enumerate all signals; CPU is the only explicit example.

Implementation gaps in the 2011 post

  • Health-check query protocol undocumented.
  • "Eventually terminate" delay not quantified.
  • Decision-to-evict threshold policy undocumented.
  • Integration with auto-scaling replacement undocumented.
  • Notification flow to service owners undocumented.

Operational numbers

None disclosed.

Seen in

Last updated · 319 distilled / 1,201 read