Skip to content

PATTERN Cited by 1 source

Managed table as default storage layer

Pattern

Use managed tables (catalog-owned storage discipline) as the default storage primitive across the entire data platform — not just for BI-serving Gold-layer tables, but across Bronze / Silver / Gold layers of the medallion architecture. The substrate (e.g. Unity Catalog) takes ownership of read, write, storage, and optimisation responsibilities, unlocking automatic features (Predictive Optimization, automatic Liquid Clustering, always-on metadata caching) that are not available with external tables. Reserve external tables for the cases where customer-owned storage paths are strictly required.

Forces

  • Optimisation pipelines the user has to maintain (OPTIMIZE, VACUUM, ANALYZE schedules).
  • Stats drift when manual ANALYZE lags behind write churn.
  • Inconsistent storage layout across tables — some clustered, some partitioned, some neither — making cross-table optimisations awkward.
  • Metadata-listing latency at query-planning time when the catalog isn't aware of underlying file layout.
  • Governance drift when external tables have different access control surfaces than catalog-tracked tables.

Solution shape (verbatim from the source)

"Unity Catalog managed tables are the foundation for everything else in this stack. Unity Catalog manages all read, write, storage, and optimization responsibilities for managed tables. This unlocks automatic features you don't get with external tables: Predictive Optimization (covered below) is enabled by default. Automatic liquid clustering selects clustering keys that adapt as query patterns change. Metadata caching is always on, reducing cloud storage requests and speeding up query planning."

"Use managed tables throughout the platform — not just for BI-serving, but across Bronze, Silver, and Gold layers. They're the default table type in Unity Catalog, and the performance and governance benefits compound with every other optimization in this stack."

The recommendation is default-on across all medallion layers, not just the Gold tier. The rationale: optimisation benefits compound — Bronze tables that are well-clustered and well-statistics-ed make Silver pipelines faster, which make Gold materializations faster, which make BI consumer queries faster.

Three properties unlocked by managed tables

Property What it does Why default-on
Predictive Optimization Auto OPTIMIZE / VACUUM / stats collection on tables that would benefit. "One of the highest-return, lowest-effort optimizations you can make."
Automatic Liquid Clustering Selects clustering keys based on observed query patterns (CLUSTER BY AUTO). Removes the need to predict clustering keys at table-creation time.
Always-on metadata caching Reduces cloud storage requests and speeds up query planning. Free win; nothing to configure.

All three are unavailable on external tables. The choice between managed and external is therefore not equal — managed tables come with substrate-owned optimisation as a default property.

Concrete steps

  1. Default new tables to managed. This is already the default in Unity Catalog; the pattern is to not override it without reason.
  2. Migrate existing external tables where the customer-owned storage path is not load-bearing. The Databricks source notes: "To move existing data, see the migration guide for converting external tables to managed." (See UC Open APIs source for the version-coordination contract.)
  3. Apply across medallion layers. Don't reserve managed tables for the Gold tier — Bronze and Silver also benefit from automatic optimisation, even though their access patterns differ.
  4. Reserve external tables for: shared-storage scenarios (multi-platform read), customer-managed-key requirements where the catalog can't host the data, vendor-lock-in concerns where the user explicitly wants storage decoupled from the catalog.

Trade-offs

Choice Implication
Default-managed Substrate owns optimisation; user accepts catalog-coupling for the storage.
Default-external User owns optimisation; loses Predictive Optimization + automatic clustering + metadata caching.
Managed across all medallion layers Compounding optimisation wins; uniform governance + lineage.
Managed only at Gold tier Bronze / Silver pay the manual-optimisation cost; loses some compounding benefit.

Where this shows up on the wiki

Failure modes

  • Vendor coupling concerns. Managed tables tie storage to the catalog (Unity Catalog). For organisations explicitly avoiding vendor lock-in at the storage layer, the trade is more complicated. Mitigation: the source argues the Open APIs surface (catalog-managed commits + Delta Kernel) preserves external-engine access despite the managed-table substrate.
  • Migration cost on existing fleets. Converting external to managed requires a one-time data migration. The source mentions a migration guide but the cost on multi-petabyte fleets is non-trivial.
  • Metadata-cache freshness contract. "Always on" metadata caching has a freshness contract under concurrent writes that the source doesn't document — reserved for future ingests.
  • Customer-owned storage path requirements. Some compliance scenarios (regulated data with specific storage-path requirements, customer-managed keys) make external tables necessary; default-managed is wrong for those.

Sibling patterns

  • The Cloudflare-side parallel: Cloudflare Town Lake sits on R2 Data Catalog (managed Iceberg on R2), same architectural shape — substrate-owned optimisation (recency-tiered recompaction in Cloudflare's case; Predictive Optimization in Databricks' case) as a default property of the managed-table substrate.
  • concepts/external-engine-write-to-managed-table — the related concept that addresses the vendor-coupling concern by making managed tables externally writeable through Delta Kernel.

Seen in

Last updated · 542 distilled / 1,571 read