HIGHSCALABILITY 2023-07-16 Tier 1

High Scalability — Lessons Learned Running Presto at Meta Scale¶

Summary¶

A Meta-authored operational retrospective (Neerad Somanchi, Production Engineer; Philip Bell, Developer Advocate) published by High Scalability, on running Presto at Meta-scale inside the company's internal data warehouse / data lakehouse. The piece is structured around four scaling pain-points — release deployment, cluster lifecycle automation, bad-host detection, and load-balancer robustness — each describing a production incident class and the automation Meta built in response. It closes with four pieces of advice for teams scaling a Presto fleet: establish well-defined customer-facing SLAs, invest in monitoring plus automated debugging, invest in load balancing, and treat configuration management as a first-class concern (hot-reloadable where possible). The article is light on numbers and diagrams, but dense in operational lessons that generalise well beyond Presto to any large multi-tenant SQL query service.

Key takeaways¶

Every Presto query at Meta traverses a load-balancer-style Gateway. The Gateway (Meta Presto Gateway) fronts "tens of thousands of machines" across dozens of Presto clusters at Meta and is the single routing plane: "our Presto clusters sit behind load balancers which route every single Presto query at Meta." This makes the Gateway both the convergence point for scaling concerns and, historically, a single point of failure that had to be hardened. (Source: §Load balancer robustness.)
Deploy new Presto builds with a canary + shadow cluster pipeline, not a big-bang rollout. A new release is first rolled into a Canary tier that catches most correctness/perf regressions, then promoted to production. For long-running queries only validated post-compilation, a Shadow Presto cluster runs alongside prod and receives a mirror of prod queries. Shadow results are compared against prod's for correctness; performance counters and resource usage are also compared. Only when both Canary and Shadow signals are green does Meta promote the build to the general fleet. This dual-stage pipeline catches classes of regressions a simple canary cannot. (Source: §Deploying new Presto releases. See patterns/canary-and-shadow-cluster-rollout.)
Fully automate cluster standup/decommission end-to-end as the fleet grows. Meta standardised on base configurations per Presto use case; a new cluster is spun up by generating configs from the base template with minimal overrides, then running test queries and auto-registering the cluster with the Gateway once they pass. Decommission is the exact reverse: de-register from the Gateway, drain running queries, shut down Presto processes, delete configs. This workflow is wired into the hardware-standup/decommission pipeline for the data warehouse, so "from new hardware showing up at a data center, to Presto clusters being online and serving queries, then being shut off when hardware is decommissioned, is fully automated." The named win: "saved valuable people-hours, reduced hardware idle time, and minimizes human error." (Source: §Automating standup and decommission of Presto clusters. See patterns/automated-cluster-standup-decommission.)
Attribute each query failure to a host, then auto-drain "bad" hosts when their error rate spikes. Meta observed a recurring failure mode where single "bad" hosts produced disproportionate query failures. Two named root causes: "hardware-level issues which hadn't yet been caught by fleet-wide monitoring systems due to lack of coverage" and "obscure JVM bugs which would sometimes lead to a steady drip of query failures." Their remediation: attribute each failure to the host that caused it where possible, alert when a host's error attribution exceeds a threshold, and auto-drain that host from the Presto fleet. This is a bad-host-detection loop sitting above the standard cluster-health-check machinery — it catches failures the cluster-health probe misses because the host still responds, just incorrectly. (Source: §Bad host detection. See patterns/bad-host-auto-drain.)
Queueing problems at scale require purpose-built analyzers, not dashboards. Presto routing at Meta considers multiple signals: "current state of queuing on Presto clusters, distribution of hardware across different datacenters, the data locality of the tables that the query uses." With this complexity, when users complain that queries are queued too long, the oncall cannot eyeball the root cause. Meta built analyzers — tools that pull data from monitoring (ODS, Scuba), host logs, and cluster state, then apply custom logic to narrow the root cause. Monitoring systems fire alerts on customer-facing SLA breaches; analyzers trigger on those alerts and hand the oncall a probable root cause plus mitigation options. For some failure classes Meta has gone further: "we have completely automated both the debugging and the remediation so that the oncall doesn't even need to get involved." (Source: §Debugging queueing issues. See patterns/oncall-analyzer, concepts/automated-root-cause-analysis.)
Make the Gateway itself robust to unintended DDoS-style traffic. An early Meta outage class was "one service unintentionally bombarding the Gateway with millions of queries in a short span, resulting in the Gateway processes crashing and unable to route any queries." Two defences were added: (a) Throttling — reject queries when the Gateway is under heavy load, configurable per-user, per-source, per-IP, and globally. (b) Autoscaling — Gateway instance count is now dynamic, riding a Meta-wide autoscaling service so the Gateway can grow under load and shrink when idle. Together these ensure the Gateway "can withstand adverse unpredictable traffic patterns." The architectural insight: even a control-plane-ish component like a query gateway needs both admission control and horizontal elasticity when it becomes every query's single hop. (Source: §Load balancer robustness. See patterns/gateway-throttling-by-dimension, patterns/gateway-autoscaling.)
Configuration management becomes a pain point at scale — design for hot-reloadability. The closing advice explicitly calls out config management: "Where possible, configurations should be made hot reloadable so that Presto instances do not have to be restarted or updated in a disruptive manner which will result in query failures and customer dissatisfaction." This is the same failure-avoidance logic that drives staged rollouts — restart-to-reconfigure burns in-flight queries and erodes customer trust. (Source: §Advice. See concepts/hot-reloadable-configuration.)
Customer-facing SLAs are the keystone of scale-ops. The first piece of scaling advice: "Defining SLAs around important metrics like queueing time and query failure rate in a manner that tracks customer pain points becomes crucial as Presto is scaled up. When there is a large number of users, the lack of proper SLAs can greatly hinder efforts to mitigate production issues because of confusion in determining the impact of an incident." SLAs at Meta-scale are not merely contracts — they are the trigger for the monitoring → analyzer → automated-remediation stack. Without a shared notion of what "bad" looks like to the customer, every subsequent automation choice is under-defined. (Source: §Advice.)

Systems & concepts extracted¶

Systems¶

Presto — Meta's distributed SQL query engine, the system under operation.
Meta Presto Gateway — Meta's internal Gateway fronting every Presto cluster; distinct from the Lyft-origin Trino Gateway open-source lineage though related in role. Routes every Presto query at Meta, supports throttling + autoscaling.
Meta Data Warehouse — the multi-datacenter data lakehouse Presto serves. Standup/decommission of Presto clusters is wired into the DW's hardware pipeline.
Tupperware — named once: cluster-turnup automation integrates with "automation hooks in order to integrate with the various company-wide infrastructure services like Tupperware." Tupperware is Meta's container/cluster management system.

Concepts¶

Customer-facing SLA — queueing time, query failure rate, defined against user pain, the trigger for alerts and analyzers.
Cluster health check — (updated) Meta's standup workflow sends test queries before Gateway-registration; shadow cluster runs parallel health signals against prod mirror.
Bad-host detection — attribute query failures to hosts; alert + auto-drain on anomalous attribution count.
Queueing — (extended) queueing time as an SLA metric, multi-signal routing to minimize queueing, analyzer-driven RCA when queues back up.
Shadow cluster — mirror of production traffic to a separate cluster running the candidate release, for post-compilation / long-running-query validation that a canary can't catch alone.
Automated root-cause analysis — analyzer pulls cross-system signals, applies rules, reports probable root cause; some failure classes are auto-remediated with no oncall touch.
Hot-reloadable configuration — reconfigure without process restart to avoid burning in-flight queries.
Blast radius — canary + shadow + staged promotion are the direct application of blast-radius bounding to a fleet-wide SQL engine release.

Patterns¶

Staged rollout — (extended) canary tier → shadow cluster → production promotion is the Presto-specific realisation.
Canary + shadow cluster rollout — dual validation (quick canary + long-running shadow) to catch both compile-time and runtime regressions before general release.
Bad-host auto-drain — per-host failure attribution + alert + auto-drain closes the gap between cluster-health probes and subtle per-host failure modes.
Automated cluster standup & decommission — base config + override model, test queries as the readiness gate, Gateway register/deregister as the lifecycle switch.
Gateway throttling by dimension — admission control at a query gateway with per-user / per-source / per-IP / global knobs.
Gateway autoscaling — horizontal elasticity for the query-gateway tier, not just the backend.
Oncall analyzer — monitoring-triggered cross-system diagnostic with probable-root-cause output; optionally fully automated remediation.

Operational numbers & architecture notes¶

Meta's original fleet is "spread out across tens of thousands of machines" across "multiple regions" for Presto alone; the Gateway routes "every single Presto query at Meta."

Signals feeding queueing/routing analyzers (named):

Operational Data Store (ODS) — Meta's internal metrics system.
Scuba — Meta's real-time log/event analytics system.
Host-level logs — fetched per-host for failure attribution.

The cluster-lifecycle automation named hooks into:

Cluster-turnup / decommission automation.
Tupperware (container/cluster manager).
Meta-wide autoscaling service (used by the Gateway).

The article describes five figures (not reproduced here; see original URL) illustrating Canary-vs-Shadow deployment, automated hardware add-to-cluster, bad-host detection, load-balancer robustness, and Presto architecture scaling.

Caveats¶

Low on numbers. The article is operational-lessons prose; no query/sec, no p99 latency, no fleet-wide failure-rate stats. Treat size claims ("tens of thousands of machines", "every query") as the article's own framing.
Not a Presto internals post. Zero detail on Presto's coordinator/worker architecture, distributed-query planner, exchange operator, or memory model. This is purely an operational-at-scale post.
Meta-specific. Tupperware, ODS, and Scuba are internal systems; the generic principles (canary+shadow, bad-host drain, gateway throttling+autoscaling, SLAs-drive-analyzers) port cleanly but the implementation details do not.
No contradiction with existing wiki content. The Meta Presto Gateway predates the open-source Trino Gateway lineage but is a distinct internal Gateway — the post does not claim shared code. The Presto page already classifies Presto as legacy / historical predecessor of Trino for the Expedia/Lyft context; this article adds that Presto is still actively operated at Meta-scale in 2023, so the "legacy" framing applies to the open-source split, not to Meta's internal deployment.