Skip to content

PATTERN Cited by 1 source

Single VARIANT UDF for multi-type masking

Pattern summary

Write one column-masking UDF that accepts and returns VARIANT, not type-specific signatures, so that a single UDF can mask INT, DOUBLE, DECIMAL, STRUCT, and other column types at once. The ABAC policy that invokes this UDF can apply uniformly to many columns of different types — "cuts down on the number of policies organizations need to maintain" (Source: sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags).

The architectural lever is type erasure via a dynamically-typed wrapper: VARIANT carries a runtime type tag plus a value, so a single function signature can dispatch on the runtime type internally and return a value of the same type without per-type definitions.

What it replaces

-- WITHOUT this pattern: one UDF per column type
CREATE FUNCTION mask_int(v INT) RETURNS INT      AS ...;
CREATE FUNCTION mask_double(v DOUBLE) RETURNS DOUBLE AS ...;
CREATE FUNCTION mask_decimal(v DECIMAL(...)) RETURNS DECIMAL(...) AS ...;
CREATE FUNCTION mask_struct(v STRUCT<...>) RETURNS STRUCT<...> AS ...;
-- … and one ABAC policy per type, each invoking the matching UDF.

-- WITH this pattern: one UDF over VARIANT
CREATE FUNCTION mask_any(v VARIANT) RETURNS VARIANT AS
  CASE
    WHEN typeof(v) = 'INT'    THEN cast(0 AS VARIANT)
    WHEN typeof(v) = 'DOUBLE' THEN cast(0.0 AS VARIANT)
    WHEN typeof(v) = 'DECIMAL' THEN cast(0 AS VARIANT)
    WHEN typeof(v) = 'STRUCT' THEN ...
  END;
-- ONE ABAC policy invokes mask_any() over all matching columns.

The pattern collapses N UDFs and N ABAC policies into 1 UDF and 1 policy. Per-policy ceiling pressure (the GA-disclosed [[systems/unity-catalog-abac|10K-per-metastore / 100-per-catalog policy ceiling]]) eases proportionally.

Why VARIANT

VARIANT is the right type-erasure carrier because:

  • It carries the runtime type alongside the value — the UDF can dispatch on typeof(v) at runtime.
  • The query engine preserves the type through the UDF call, so the masked return value flows back to the surrounding plan as if it were the original column type.
  • It composes with STRUCT columns transparently — nested fields inside a STRUCT<VARIANT> keep their typing.

Other erasure shapes (STRING, BINARY) lose type information; the mask would need to roundtrip through string serialisation and the query plan would have to cast back to the original type, which breaks optimisation.

Operational properties

  • One masking convention per category, not per type. A team that wants to "replace numeric values with zero" writes one UDF; ABAC policies referencing that UDF cover every numeric column tagged with the relevant governed tag — regardless of whether the column is INT, DOUBLE, or DECIMAL.
  • Lower policy count, which matters under the policy-ceiling constraints — see systems/unity-catalog-abac for the 10K/100 limits.
  • Easier evolution: when the masking convention changes, one UDF is updated rather than N type-specific UDFs.
  • Type checking deferred to runtime: the dispatch on typeof(v) happens at query time. The cost is per-row dispatch; the savings come from policy + UDF count reduction.

When it fits

  • Many columns of different types share the same masking semantics (replace with zero, replace with empty, replace with hash, etc.).
  • The masking output is type-stable — the function returns a value of the same type as the input.
  • The query engine has good VARIANT support (Unity Catalog SQL, Databricks SQL, Snowflake VARIANT — but not all SQL dialects).

When it doesn't fit

  • Type-specific masking semantics. "Mask SSN as XXX-XX- + last four" requires string-typed input/output; mixing it with numeric masking via VARIANT obscures the semantics.
  • Cost-sensitive UDF evaluation. The dispatch + VARIANT roundtrip cost matters at very high query volumes; per-type UDFs can be inlined or pre-compiled more aggressively.
  • Legacy SQL dialects without VARIANT. The pattern is not portable to engines that don't carry runtime type tags.

Sibling patterns

  • One mask UDF per data type — what this pattern replaces; the pre-VARIANT shape.
  • Cast-and-roundtrip via STRING — a poor erasure proxy that loses type information.
  • Type-class polymorphism in language UDFs (Scala, Python) — same idea but the dispatch happens in the host language; the UDF catalog still needs separate signatures per type unless the engine compiles a generic catalog entry.

Seen in

Last updated · 542 distilled / 1,571 read