SYSTEM Cited by 1 source
Meta Presto Gateway¶
Definition¶
Meta Presto Gateway is Meta's internal load-balancer / proxy tier sitting in front of every Presto cluster at Meta. It is the single routing plane for all Presto queries inside the company: "our Presto clusters sit behind load balancers which route every single Presto query at Meta" (sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale).
It is distinct from the open-source Trino Gateway (which originated at Lyft as Presto Gateway and was later integrated into the Trino ecosystem). The Meta-authored 2023 High Scalability post does not claim any shared lineage — Meta operates its own internal Gateway built for its scale and integration needs.
Role at Meta scale¶
- Every Presto query traverses the Gateway.
- Routing decisions consume multiple signals: current queueing state of downstream Presto clusters, "distribution of hardware across different datacenters," and "the data locality of the tables that the query uses." This is workload-aware routing extended with data-locality awareness (see also concepts/locality-aware-scheduling).
- Gives clients a single endpoint abstraction over "tens of thousands of machines" spread over multiple regions.
- Cluster lifecycle integration: new Presto clusters register with the Gateway to start receiving traffic; decommissioning Presto clusters deregister before draining (see patterns/automated-cluster-standup-decommission).
Robustness: throttling + autoscaling¶
Early in Meta's Presto scale-up the Gateway was a single point of failure. Named incident class: "one service unintentionally bombarding the Gateway with millions of queries in a short span, resulting in the Gateway processes crashing and unable to route any queries." Two defences were added:
- Throttling by dimension — the Gateway rejects queries under heavy load. The throttle knobs operate across multiple axes: "per user, per source, per IP, and also at a global level for all queries" — so a runaway batch job cannot starve an interactive dashboard user, and the global knob prevents total collapse.
- Gateway autoscaling — "leaning on a Meta-wide service that supports scaling up and down of jobs, the number of Gateway instances are now dynamic." The Gateway tier scales out under load rather than maxing out CPU/mem on a fixed fleet, "thus preventing the crashing scenario described above."
Together, throttling + autoscaling make the Gateway robust against unintended DDoS-style internal traffic — a class of failure typical for internal shared-infrastructure gateways at hyperscale.
Distinction from Trino Gateway¶
| Aspect | Meta Presto Gateway | Trino Gateway |
|---|---|---|
| Origin | Meta-internal | Lyft-origin (Presto Gateway), now Trino OSS |
| Engine | PrestoDB | Trino |
| Code base | Proprietary | Open source (trinodb/trino-gateway) |
| Routing signals | Queue state, DC topology, data locality | Routing rules on query body + headers + cluster health |
| Admission control | Per-user/-source/-IP/-global throttling | Health-based cluster selection |
| Elasticity | Meta-wide autoscaling service | Operator-managed |
Both implement the query-gateway pattern and share the "single connection URL + route-per-query" shape; the specifics above diverge.
Seen in¶
- sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — primary source for this page. Named scale: "tens of thousands of machines" behind the Gateway; the Gateway routes every Presto query at Meta. Robustness story: throttling + autoscaling after initial outage class.
Related¶
- systems/presto — the query engine being fronted.
- systems/meta-data-warehouse — the data platform this Gateway serves.
- systems/trino-gateway — the open-source cousin with a related but distinct lineage.
- patterns/query-gateway — the general architectural pattern.
- patterns/gateway-throttling-by-dimension — the admission-control pattern applied here.
- patterns/gateway-autoscaling — the elasticity pattern applied here.
- concepts/workload-aware-routing — the routing discipline.
- companies/meta