Skip to content

SYSTEM Cited by 1 source

Meta Data Warehouse

Definition

Meta's Data Warehouse (a.k.a. the Meta data lakehouse) is the company-wide analytical data platform served by Presto and fronted by Meta Presto Gateway. The 2023 High Scalability retrospective frames it as "our Data Lakehouse" spanning multiple data-center regions with distinct data-locality characteristics that drive query-routing decisions.

Distinguishing characteristics (from the source)

  • Multi-datacenter. "The distribution of the data warehouse at Meta across different regions is constantly evolving."
  • Evolving footprint. New Presto clusters are stood up and decommissioned continuously as hardware cycles through data centers — the hardware-standup and decommission workflow is wired directly into the Presto cluster lifecycle automation (see patterns/automated-cluster-standup-decommission).
  • Data locality informs query routing. The Gateway's routing decision considers "the data locality of the tables that the query uses" — queries are preferentially dispatched to Presto clusters whose underlying storage holds the referenced tables, minimising cross-region I/O.
  • Mixed workload. Both interactive and long-running queries run through the Presto / Gateway stack — the long-running class is what forces Meta to run shadow clusters at release-validation time, since ordinary canary clusters do not exercise them long enough.

Integration with Meta infra

  • Tupperware — Meta's container/cluster manager, named as the integration hook for automated Presto cluster turn-up: "Cluster turnup also required integration with automation hooks in order to integrate with the various company-wide infrastructure services like Tupperware and data warehouse-specific services."
  • Operational Data Store (ODS) — monitoring metrics consumed by oncall analyzers.
  • Scuba — real-time event analytics, also consumed by analyzers.

Seen in

Last updated · 319 distilled / 1,201 read