Skip to content

PATTERN Cited by 1 source

Scan planning as policy enforcement point

What it is

The architectural pattern of using the catalog's scan-planning request as the chokepoint at which governance policies are evaluated and a filtered scan plan is returned — instead of relying on each compute engine to apply policies client-side after reading data.

The pattern is the structural answer to "how do you enforce row-level / column-level governance uniformly across many compute engines that all read the same tables?" Engines speak a common scan-planning protocol to the catalog. The catalog is the policy decision point (PDP): it evaluates the policies, applies row filters, applies column masks, prunes columns, and returns a scan plan that already reflects the policy. Engines are policy enforcement points (PEPs) only insofar as they honour the plan they received.

Canonical instance: Unity Catalog cross-engine ABAC (2026-05-28)

The 2026-05-28 announcement of cross-engine ABAC in Unity Catalog Beta is the canonical wiki instance:

"When an external Iceberg engine requests access, UC evaluates the applicable policies during server-side scan planning. UC then returns a filtered scan plan so the engine only reads authorized data when processing the query."

(Source: sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governance)

The wire-protocol surface is the Iceberg REST Catalog Scan Planning API (added in Iceberg 1.11). Any engine that implements the scan-planning client gets ABAC enforcement on its queries; engines below the version floor either don't speak the API at all or fall back to a non-policy-aware scan path.

Architectural shape

                   ┌──────────────────────────────────┐
                   │  Catalog server (PDP)            │
                   │                                  │
   plan-scan(...)  │  for each table referenced:      │
   ──────────────► │    1. resolve identity +         │
                   │       request context            │
                   │    2. fetch table tags +         │
                   │       inherited tags             │
                   │    3. evaluate ABAC policies     │
                   │       → row filter predicates    │
                   │       → column mask UDFs         │
                   │       → projection pruning       │
                   │    4. apply to scan plan         │
                   │                                  │
   ◄────────────── │  Returns: filtered scan plan     │
   filtered plan   │   - file list (subset)           │
                   │   - column projections           │
                   │   - residual predicates          │
                   └──────────────────────────────────┘
                              │  engine reads only files
                              │  in the plan; applies
                              │  residual predicates;
                              │  produces result
                       result set (already
                       reflects policy)

Why it works

  • One PDP, many PEPs. Authoring policy once in the catalog is the only sustainable shape when N engines (and growing) all read the same tables. The N-engine fan-out doesn't multiply the policy-authoring or policy-consistency burden.
  • No engine-side policy library required. Engines only need to honour a filtered scan plan — a much smaller and more stable contract than implementing the catalog's policy schema.
  • Audit substrate. Every scan-plan request becomes an auditable event with principal, table, predicate, projected columns. The catalog sees every query intent.
  • Policy can use catalog-side state. The policy can reference table tags, classification metadata, inherited tags from parent catalog/schema — all of which the catalog already holds. Engines wouldn't easily have access to this context.
  • Policy evaluation is centralised, so cache-friendly. The catalog can cache policy evaluations per (principal, table, query-fingerprint) to amortise the cost of repeated identical queries.

Constraints / trade-offs

  • Compatibility floor. Pattern requires the catalog and the engine to speak a scan-planning protocol. The Iceberg-1.11 floor in the canonical instance defines a hard line — engines below it can't participate.
  • Catalog-side latency on the query path. Server-side scan planning adds a catalog-side compute step before each query. The catalog must be sized for query-rate scale, not just metadata-fetch scale.
  • Catalog-side complexity. The catalog must implement query planning to a useful depth (predicate pushdown, partition pruning, column projection, statistics) — otherwise the scan plan it returns is naive and engines lose optimisation opportunities they previously did client-side.
  • Trust model. Engines must honour the residual parts of the plan (predicates, masks). A misbehaving engine that ignores residuals can read more than authorised. Mitigations: (1) push as much of the policy as possible into the file/column list directly so residual evasion has no effect; (2) audit deltas between scan-plan output and engine-reported reads; (3) restrict storage credentials to the file list the plan returned (credential vending scoped to the planned file list).
  • Some policy types may be hard to express at scan-plan time. Column masks that require stateful evaluation (cross-row computations) or randomised UDFs may not lower cleanly into a scan plan; partial enforcement falls back to the engine.

Comparison to alternatives

Approach Where policy lives Trust model Engine fan-out cost
Policy in the engine Each engine implements policies Trust every engine Linear in engine count
Policy in a sidecar / proxy A query-rewriter / proxy in front of the engine Trust the proxy Per-engine deployment of proxy
Single-engine governance Only the catalog vendor's compute is allowed Implicit (one engine) Doesn't scale to multi-engine
Scan planning as PEP (this pattern) Catalog PDP + engine honours plan Trust the catalog + engine residual honour Constant; engines just speak the API

The pattern's trust model is better than per-engine policies because there's only one PDP, better than single-engine governance because it preserves engine choice, and structurally simpler than per-engine sidecars.

When this is the right shape

  • Multi-engine open lakehouse with a central catalog.
  • Customers / regulators require uniform governance regardless of engine.
  • Engines are willing to update to a recent enough Iceberg version to speak the scan-planning client.

When it isn't

  • Single-engine deployments — overhead of catalog-side scan planning isn't justified.
  • Engines that can't update to the scan-planning client floor.
  • Workloads where the catalog isn't the trust root (catalog-as-PDP isn't viable).
  • Policies that fundamentally require per-row UDF evaluation that can't be expressed at plan time.

Caveats

  • First canonical instance is Beta. Cross-engine ABAC on UC is in Beta as of 2026-05-28. Production reliability characteristics, latency profiles, and failure-mode behaviour are not yet disclosed.
  • Iceberg-specific surface. The pattern as-canonicalised rides on the Iceberg REST scan-planning API. Equivalent patterns for Delta or Hudi would need their own scan-planning protocol additions.
  • Policy expression-level constraints undisclosed. Which subset of UC ABAC policies is fully evaluable at scan-plan time vs partially evaluable is not specified.
  • No latency / throughput numbers. The 2026-05-28 source provides no benchmarks for scan-plan latency overhead.

Seen in

Last updated · 542 distilled / 1,571 read