SYSTEM Cited by 4 sources

AWS Glue¶

AWS Glue is AWS's serverless ETL + data-catalog offering. It bundles a Hive-Metastore-compatible catalog (the "Glue Data Catalog") and a serverless Spark runtime (and more recently a serverless Ray runtime; see systems/aws-glue-for-ray). Sits alongside systems/amazon-emr as the serverless option for Spark jobs on AWS, and is commonly used as the metadata catalog for data-lake engines (Athena, Redshift Spectrum, EMR, Spark, Databricks).

Role for this wiki¶

Glue appears in two roles:

Serverless Spark / Ray runtime — a managed alternative to running Ray or Spark on raw systems/aws-ec2.
Catalog substrate — Iceberg tables on AWS Glue are the canonical "tables on S3 with a catalog" shape outside Databricks.

Seen in¶

sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — named as one of the compute-framework choices for Amazon BDT table subscribers; and systems/aws-glue-for-ray called out as one of the two managed Ray runtimes that mean users don't need to build their own serverless Ray job management (alongside systems/anyscale-platform).
sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh — Mercedes-Benz's ~60 TB after-sales dataset is stored as Iceberg-on-AWS-Glue on the producer side, then federated into systems/unity-catalog for cross-cloud sharing via Delta Sharing.
sources/2025-05-27-yelp-revenue-automation-series-testing-an-integration-with-third-party-system — Yelp's Staging Pipeline publishes Revenue Data Pipeline output to AWS Glue data catalog tables on S3, queryable immediately via Redshift Spectrum. This bypasses the ~10-hour Redshift Connector latency for same-day verification.
sources/2025-09-26-yelp-s3-server-access-logs-at-scale — Glue Data Catalog as the cross-account Athena query fabric for S3 Server Access Logs at fleet scale. Yelp uses a single "querying" AWS account that registers Glue Data Catalogs from every source account (via ListDataCatalogs) so one IAM role can query FROM "catalog"."database"."table_region" across the fleet without role-pivoting. Canonical wiki instance of Glue partition projection choice over managed partitions (patterns/projection-partitioning-over-managed-partitions): enum for bucket_name (with a Lambda + SQS + EventBridge loop keeping the partition list fresh) and date-granular timestamp. Glue table SERDE properties carry the SAL regex (input.regex) with Yelp's optional-non-capturing-tail fix for user-controlled fields.

systems/aws-glue-for-ray — managed Ray subflavor.
systems/apache-spark — Glue's longer-established runtime.
systems/apache-iceberg — typical catalog-resident table format on Glue.
systems/apache-hive — Glue Data Catalog speaks the Hive Metastore protocol.

AWS Glue¶

Role for this wiki¶

Seen in¶

Related¶