Skip to content

SYSTEM Cited by 1 source

Pinterest URL Normalizer

What it is

The URL Normalizer is the runtime component in Pinterest's content ingestion pipeline that rewrites incoming merchant URLs into a deduplicated canonical form before the expensive fetch + render + process work runs. It is the consumer of the MIQPS map produced by the offline MIQPS computation job (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).

Runtime behaviour

"At runtime, the URL processor loads the MIQPS map from the config store at initialization. For each URL it processes, it looks up the query pattern, retrieves the non-neutral parameter set, and strips all parameters not matched by any of the four normalization layers."

In-memory lookup; the expensive rendering + content-ID computation has already happened offline in the MIQPS job.

Four normalisation layers

A URL parameter is kept if any layer preserves it, stripped only if all layers agree it's not needed. Canonical patterns/multi-layer-normalization-strategy instance. Layers:

  1. Static platform allowlists — curated lists for well-known e-commerce platforms (Shopify variants, Salesforce Commerce Cloud start / sz / prefn1 / prefv1).
  2. Regex patterns"widely used parameter naming schemes."
  3. MIQPS non-neutral set — the learned long-tail classifier.
  4. Conservative default — when MIQPS has insufficient samples for a (domain, pattern, parameter), the parameter is treated as non-neutral by default.

The OR semantics (keep if any layer votes keep) is load-bearing: the cost of stripping a non-neutral parameter (silently merging distinct items) is catastrophic; the cost of keeping a neutral parameter is just a redundant render. Every layer acts as an independent guardrail.

Initialisation + freshness

Loads the MIQPS map from the config store at initialisation. Publication cadence (how often runtime reloads / whether hot-reloadable) not disclosed. The offline job runs per-domain per-content-processing- cycle, so the config store is updated at that cadence; staleness is acceptable because "URL parameter conventions change infrequently — on the order of weeks or months."

Upstream / downstream

  • Upstream: content ingestion pipeline — produces the URLs being normalised and is also the source of the URL corpus that drives MIQPS.
  • Downstream: the rest of the ingestion pipeline (fetch → render → process → deduplicate). Post-normalisation, URLs are routed by content-ID-based dedup downstream; URL normalisation is the upstream cheap-dedup layer that prevents redundant render work.

Caveats

  • No latency budget disclosed. The post says runtime lookup is "fast" and "in-memory" but doesn't quantify per-URL overhead.
  • No hot-reload story disclosed. Behaviour on MIQPS publish (reload cadence, consistency, rolling updates across replicas) not documented.
  • No error semantics disclosed. What the normaliser does on an unknown-domain URL, or an unparseable query string, or a pattern outside the MIQPS map, not documented.

Seen in

Last updated · 319 distilled / 1,201 read