SYSTEM Cited by 1 source
Meta Data Warehouse¶
Definition¶
Meta's Data Warehouse (a.k.a. the Meta data lakehouse) is the company-wide analytical data platform served by Presto and fronted by Meta Presto Gateway. The 2023 High Scalability retrospective frames it as "our Data Lakehouse" spanning multiple data-center regions with distinct data-locality characteristics that drive query-routing decisions.
Distinguishing characteristics (from the source)¶
- Multi-datacenter. "The distribution of the data warehouse at Meta across different regions is constantly evolving."
- Evolving footprint. New Presto clusters are stood up and decommissioned continuously as hardware cycles through data centers — the hardware-standup and decommission workflow is wired directly into the Presto cluster lifecycle automation (see patterns/automated-cluster-standup-decommission).
- Data locality informs query routing. The Gateway's routing decision considers "the data locality of the tables that the query uses" — queries are preferentially dispatched to Presto clusters whose underlying storage holds the referenced tables, minimising cross-region I/O.
- Mixed workload. Both interactive and long-running queries run through the Presto / Gateway stack — the long-running class is what forces Meta to run shadow clusters at release-validation time, since ordinary canary clusters do not exercise them long enough.
Integration with Meta infra¶
- Tupperware — Meta's container/cluster manager, named as the integration hook for automated Presto cluster turn-up: "Cluster turnup also required integration with automation hooks in order to integrate with the various company-wide infrastructure services like Tupperware and data warehouse-specific services."
- Operational Data Store (ODS) — monitoring metrics consumed by oncall analyzers.
- Scuba — real-time event analytics, also consumed by analyzers.
Seen in¶
- sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — the primary surfacing of the Meta Data Warehouse on the wiki. Described as the context the Presto fleet operates in; the hardware-lifecycle pipeline feeds Presto cluster standup and decommission.
Related¶
- systems/presto — primary SQL engine over the warehouse.
- systems/meta-presto-gateway — query router for the warehouse.
- systems/tupperware — cluster management substrate.
- companies/meta