CONCEPT Cited by 1 source

Feature taxonomy alignment¶

Definition¶

Feature taxonomy alignment is the data-engineering activity of ensuring that the same conceptual feature carries the same semantic meaning across source and target domains in a transfer-learning system — most commonly by mapping or unifying catalog category trees, attribute vocabularies, and feature encodings between the source and target catalogs.

Quote (Source: sources/2026-05-04-instacart-empowering-carrot-ads-with-domain-adaptive-learning):

"Common contextual and catalog-level features between the Instacart Marketplace's catalog data and the Carrot Ads Partner's catalog are aligned (e.g. ensuring product category uses the same taxonomy) to ensure the source domain knowledge is transferable."

It's the data-level pre-condition for transfer learning to work. If the same feature name carries different semantic meaning across domains, embeddings trained on the source mis-fit the target — driving negative transfer.

Why it matters¶

The math of transfer learning presumes that an input feature "product_category = sports_nutrition" means the same thing on both source and target. If on the source it's a leaf category covering protein powders, while on the target it's a parent category covering protein, hydration, and electrolytes, the embeddings encode different concepts under the same string — silently breaking transfer.

The alignment work is upstream of training and unglamorous but load-bearing. Without it, the rest of the transfer-learning pipeline produces correct-looking but wrong predictions.

Common alignment scenarios¶

Catalog category taxonomies¶

Source and target catalogs likely have different category hierarchies. Alignment options:

Map both to a canonical taxonomy (e.g., GS1 Global Product Classification, Google Product Taxonomy).
Use one domain's taxonomy as the canonical and translate the other's into it.
Project both into a shared embedding space trained on text descriptions, sidestepping the explicit taxonomy.

Attribute schemas¶

Sources tracking "size" in cm, target tracking in inches, implicit-units fields, free-text vs structured. Alignment requires unit normalisation and value canonicalisation before training.

User-segment definitions¶

The source domain may segment users by an in-house RFM model; the target by a different model. Aligning these requires either re-segmenting users on a shared schema or treating segment IDs as untransferable.

Behavioral signal definitions¶

A "click" on the source might mean impression-with-pause; on the target, click-with-redirect. Same name, different meaning. Alignment may require introducing new event types or filtering behaviors to a shared definition.

Feature-encoding conventions¶

Categorical features encoded as IDs, string hashes, or one-hot vectors must be consistent. Source-domain pre-trained embeddings keyed by integer IDs are useless if the target domain re-keys them.

Where it sits in a DAL pipeline¶

[Source domain catalog]      [Target domain catalog]
        │                              │
        ▼                              ▼
        └─────── alignment work ───────┘
                       │
                       ▼
            Aligned feature schemas
                       │
                       ▼
   Pre-train shared embeddings on source
                       │
                       ▼
   Fine-tune target-specific layers on aligned features
                       │
                       ▼
       [DAL](<./domain-adaptive-learning.md>) pipeline runs cleanly

Alignment is one of two gating activities that must succeed for DAL to work — the other is model alignment verification (does the output of the transferred model match expected target-domain calibration).

What can go wrong¶

Silent semantic drift — the alignment looks superficially correct but a small fraction of categories carry different meaning. Result: subtle, hard-to-detect negative transfer.
Alignment becomes stale — source or target taxonomies evolve independently after deployment. Without re-alignment, the schema slowly diverges.
Long tail of unalignable features — some target features have no source counterpart. These can't be transferred but may be valuable; deciding whether to include them in a fine-tuned target-specific layer or drop them is a design choice.

Mitigations¶

Human-in-the-loop verification — Instacart Carrot Ads' current production stance: "the complexity of mapping data schemas and verifying model alignment currently requires human-in-the-loop verification to prevent negative transfer."
Automated taxonomy diff — detect when source / target taxonomies diverge over time and trigger re-alignment.
Shared-text-embedding fallback — when explicit taxonomies can't be aligned, embed item text instead. This sidesteps taxonomy mismatches but requires a strong text encoder.
Side-by-side evaluation against from-scratch training on the target — if alignment is broken, the from-scratch baseline may beat the transferred model on the target.

Generalisation across the wiki¶

Feature taxonomy alignment shows up in many cross-domain ML contexts:

Multi-tenant ad platforms — Carrot Ads (this article). Each new partner's catalog must be aligned to Instacart Marketplace.
Cross-marketplace ranking — Amazon, Walmart, eBay all run cross-marketplace ML and confront the same alignment problem.
Multilingual / multi-region recsys — same task, different language taxonomies.
Federated learning across institutions — each institution has its own schema; alignment is the upstream blocker.
Industrial entity resolution — see concepts/entity-resolution for the related case where the same physical entity is referenced under different schemas across data sources (analogous problem at the entity level rather than the feature level).

Seen in¶

sources/2026-05-04-instacart-empowering-carrot-ads-with-domain-adaptive-learning — first wiki canonicalisation. Catalog-level features aligned between Instacart Marketplace catalog and partner catalog, with product-category taxonomy specifically named as the alignment target. Alignment named as the data-level pre- condition for cross-domain knowledge transfer.