PATTERN Cited by 1 source

Automated cluster standup and decommission¶

Fully automate the end-to-end lifecycle of serving clusters: from hardware arriving at a data center through configuration, readiness testing, gateway registration, to eventual decommission. Standardise on base configurations per use case so a new cluster is spun up by generating configs from a template with minimal overrides. Wire the workflow into the data-center hardware pipeline so the fleet follows hardware supply without human intervention.

Why automate¶

At small fleet size, operators can hand-turn up a new cluster and hand-tear-down an old one. As the fleet grows to "dozens of clusters across multiple regions" the manual path stops scaling: tracking every change manually becomes infeasible, people-hours saturate, and hardware sits idle between physical delivery and actual use.

The named win for the Meta Presto deployment:

"from new hardware showing up at a data center, to Presto clusters being online and serving queries, then being shut off when hardware is decommissioned, is fully automated. Implementing this has saved valuable people-hours, reduced hardware idle time, and minimizes human error."

Machinery required¶

Base configurations per use case. Every cluster of a given type (interactive, batch, etc.) starts from a shared template with minimal per-cluster overrides. Makes generation and drift-detection tractable.
Readiness gate — test queries. Before a cluster takes production traffic, automation runs test queries and verifies success. Cluster is only registered with the routing tier (e.g. Meta Presto Gateway) if it passes.
Gateway register/deregister as the lifecycle switch. Visibility to the routing plane is the single binary indicating "in" or "out of" the serving fleet.
Integration with hardware pipeline. The cluster-lifecycle workflow is triggered by the data-center hardware standup/decommission pipeline, not a separate human process.
Decommission as the reverse. Deregister from the Gateway, wait for running queries to drain, shut down processes, delete configs.

Seen in¶

sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — Meta's Presto fleet. Base configurations per Presto use case; cluster standup integrates with automation hooks for company-wide infrastructure services including Tupperware (Meta's container/cluster manager). Test queries gate readiness; Gateway registration is the lifecycle switch. Workflow is wired into the data-warehouse hardware standup/decommission pipeline, yielding end-to-end automation from hardware arrival to serving.

Automated cluster standup and decommission¶

Why automate¶

Machinery required¶

Seen in¶

Related¶