SYSTEM Cited by 3 sources
Apache Hive¶
Apache Hive is the Hadoop-era SQL-on-HDFS query engine + metastore — the system that popularised the "tables over a data lake" idea at Facebook in the late 2000s and became the default metadata catalog for early cloud data lakes. The Hive Metastore (HMS) protocol is still the catalog interface many newer engines implement (or federate with).
Role for this wiki¶
Hive appears in two guises:
- As a legacy SQL engine over HDFS / S3 — commonly superseded by newer engines (systems/apache-spark, systems/amazon-athena, systems/amazon-redshift, Trino) but still running in many enterprise stacks.
- As a catalog lineage ancestor — the "table over data files" concept that open table formats like systems/apache-iceberg generalised and made transactional.
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — Amazon BDT's post-Oracle BI stack started with "a mix of Amazon Redshift, Amazon RDS, and Apache Hive on systems/amazon-emr" for compute over S3. Hive is named as one of the compute frameworks a table subscriber could pick from.
- sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution
— Hive tables are the offline storage lane of Lyft's
Feature Store. Each batch feature
is defined with a SparkSQL query reading from Hive tables, and
the Airflow-generated DAG writes the dataframe output back to
Hive for historical analysis + ML model training (plus to the
online
dsfeatureslayer). Concrete instance of Hive-as-data-lake-catalog in an ML feature pipeline. - sources/2024-12-02-meta-built-large-scale-cryptographic-monitoring — Hive's origin company (Meta / Facebook) cites its internal Hive-based warehouse as the cold-storage tier for cryptographic-monitoring telemetry — the long-retention layer beneath Scuba's warm tier. The public reference is the 2010 Hive paper from which Apache Hive descends. See systems/meta-hive for the Meta-internal variant.
Related¶
- systems/apache-spark — the successor generalist engine.
- systems/amazon-emr — typical AWS managed substrate.
- systems/apache-iceberg — Iceberg modernises what Hive's metastore + partitioned-directory layout did.
- systems/aws-glue — AWS's managed Hive-Metastore-compatible catalog service.