SYSTEM Cited by 4 sources
AWS Glue¶
AWS Glue is AWS's serverless ETL + data-catalog offering. It bundles a Hive-Metastore-compatible catalog (the "Glue Data Catalog") and a serverless Spark runtime (and more recently a serverless Ray runtime; see systems/aws-glue-for-ray). Sits alongside systems/amazon-emr as the serverless option for Spark jobs on AWS, and is commonly used as the metadata catalog for data-lake engines (Athena, Redshift Spectrum, EMR, Spark, Databricks).
Role for this wiki¶
Glue appears in two roles:
- Serverless Spark / Ray runtime — a managed alternative to running Ray or Spark on raw systems/aws-ec2.
- Catalog substrate — Iceberg tables on AWS Glue are the canonical "tables on S3 with a catalog" shape outside Databricks.
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — named as one of the compute-framework choices for Amazon BDT table subscribers; and systems/aws-glue-for-ray called out as one of the two managed Ray runtimes that mean users don't need to build their own serverless Ray job management (alongside systems/anyscale-platform).
- sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh — Mercedes-Benz's ~60 TB after-sales dataset is stored as Iceberg-on-AWS-Glue on the producer side, then federated into systems/unity-catalog for cross-cloud sharing via Delta Sharing.
- sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — Yelp's Staging Pipeline publishes Revenue Data Pipeline output to AWS Glue data catalog tables on S3, queryable immediately via Redshift Spectrum. This bypasses the ~10-hour Redshift Connector latency for same-day verification.
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale —
Glue Data Catalog as the cross-account Athena query fabric
for S3 Server Access Logs at fleet scale. Yelp uses a single
"querying" AWS account that registers Glue Data Catalogs from
every source account (via
ListDataCatalogs) so one IAM role can queryFROM "catalog"."database"."table_region"across the fleet without role-pivoting. Canonical wiki instance of Glue partition projection choice over managed partitions (patterns/projection-partitioning-over-managed-partitions):enumforbucket_name(with a Lambda + SQS + EventBridge loop keeping the partition list fresh) and date-granulartimestamp. Glue table SERDE properties carry the SAL regex (input.regex) with Yelp's optional-non-capturing-tail fix for user-controlled fields.
Related¶
- systems/aws-glue-for-ray — managed Ray subflavor.
- systems/apache-spark — Glue's longer-established runtime.
- systems/apache-iceberg — typical catalog-resident table format on Glue.
- systems/apache-hive — Glue Data Catalog speaks the Hive Metastore protocol.