SYSTEM Cited by 2 sources

Apache Hudi¶

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is one of the three canonical open table formats alongside systems/apache-iceberg and systems/delta-lake — all layered over systems/apache-parquet on object storage to provide transactional row-level inserts, updates, deletes, and schema evolution on top of immutable objects.

Minimum viable framing for this wiki: same structural role as Iceberg — see concepts/open-table-format for the shared shape. Hudi's distinguishing emphasis is near-real-time upsert ingestion with two on-disk storage modes: Copy-on-Write (concepts/copy-on-write-merge) and Merge-on-Read (read-time merge of a base file + a log of delta records).

Seen in¶

sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — named alongside Iceberg and Delta Lake as the open table formats systems/deltacat (and therefore the Amazon BDT Flash Compactor architecture) is designed to extend to.
sources/2024-03-14-highscalability-brief-history-of-scaling-uber — Clemm's Uber retrospective attributes Hudi's origin to Uber's need to "power business critical data pipelines at low latency and high efficiency" — Hudi was built at Uber and graduated to Apache Top-Level Project status.

systems/apache-iceberg, systems/delta-lake — peer OTFs.
systems/apache-parquet — base data-file format.
concepts/open-table-format — the category these formats occupy.
concepts/copy-on-write-merge — one of Hudi's two storage-mode strategies.
concepts/change-data-capture — Hudi's canonical upstream workload is CDC from an OLTP source.
companies/uber — origin org.

Apache Hudi¶

Seen in¶

Related¶