Skip to content

SYSTEM Cited by 1 source

Databricks Auto Loader

Databricks Auto Loader is a high-throughput Spark Structured Streaming source that incrementally discovers and processes new files landing in cloud object storage (S3 / ADLS / GCS) without requiring manual file listing or state management.

Official docs.

Core mechanics

  • File discovery: Auto Loader tracks which files have been processed using a state store backed by object-storage notifications (where available) or directory listings.
  • Incremental processing: new files are processed in micro- batches as they arrive, driving downstream Structured Streaming transformations.
  • Automatic state: metadata about discovered files is persisted by Auto Loader — users don't implement their own watermark / checkpoint logic for file discovery.
  • Near-real-time arrival patterns: designed for steady streams of new files (cloud-native logs, metric exports, CDC outputs).

Seen in

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical observability ingestion use case. Hydra uses Auto Loader as its Structured Streaming source to "efficiently discover and ingest millions of object storage files" at the 20-billion-active- timeseries scale. Auto Loader "automatically persists metadata about discovered files and scales to handle near- real-time arrival patterns," making it viable as the front-door for a lakehouse-native observability platform.
Last updated · 451 distilled / 1,324 read