Pinterest — Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication¶
Summary¶
Pinterest Engineering post (Shanhai Liao, Di Ruan, Evan Li — Content Acquisition and Media Platform) documenting MIQPS — Minimal Important Query Param Set, a per-domain algorithm that learns which URL query parameters actually matter for content identity and strips the rest, enabling URL-level deduplication before the expensive content-fetch / render / process pipeline runs. The motivating waste: merchant sites expose the same product page under many URL variants decorated with tracking params (utm_source, session, ref, click_id, tracking) and session tokens; without URL normalisation every variant is independently fetched, rendered, and processed. Curated allowlists work for well-known platforms (Shopify: variants; Salesforce Commerce Cloud: start / sz / prefn1 / prefv1) but can't scale to Pinterest's long tail of merchant domains. The algorithm: group observed URLs by query parameter pattern (the sorted set of param names present), then for each param in each pattern sample up to S URLs with distinct values, render the page with and without the param, compare visual content IDs, and classify the param non-neutral if removal changes the content ID in ≥T% of samples. Output is a MIQPS map (pattern → non-neutral-param set) published to a config store and consumed at runtime. Production design: multi-layer normalisation (static allowlist + regex + MIQPS), anomaly detection before publishing (reject new MIQPS if more than A% of existing patterns had a previously-non-neutral param flip to neutral — the dangerous case — while additions + pattern-disappearances are allowed), early-exit sampling (stop testing a param once mismatch rate already exceeds T% after N successful tests), and conservative defaults (fewer than N samples → treat as non-neutral by default). Three-phase architecture: continuous ingestion writes per-domain URL corpus to S3, offline job runs the MIQPS algorithm + anomaly checks + publishes to config store + archives, runtime URL processor loads the map at init and does in-memory lookups. Load-bearing rationale for offline over real-time: each content-ID compute is a full page render (seconds); testing every param on every new URL multiplies this; offline scales with domain count, real-time would scale with URL count (orders of magnitude more expensive); transient render failures are retryable in an offline job, but would directly block content processing in the realtime path. Scale: "hundreds of thousands of domains", "billions of URLs". No quantified deduplication-ratio, cost-savings, or latency numbers disclosed.
Key takeaways¶
-
URL normalisation is a cost-control problem, not a metadata problem. The wasted-work framing: "every variation is independently fetched, rendered, and processed. At scale, this redundant ingestion and processing represents a significant waste of computational resources — rendering the same page dozens of times simply because its URLs differ in irrelevant parameters." Deduplication by content identity catches variants eventually, but only after paying the render + process cost; URL-level deduplication catches them before. Canonical concepts/url-normalization framing — the cost is paid upstream, not downstream (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).
-
Static allowlists work for head domains; long-tail needs adaptive learning. "For well-known e-commerce platforms, this can be solved with curated rules. Shopify URLs, for example, use
variantsas the key product differentiator. Salesforce Commerce Cloud uses parameters likestart,sz,prefn1, andprefv1. For these platforms, static allowlists are sufficient. But Pinterest ingests content from a large number of domains, operating on a wide variety of platforms. For this long tail of domains, URL parameter conventions vary wildly. Static rules cannot scale to cover them all. We need a dynamic, data-driven approach." Canonical patterns/per-domain-adaptive-config-learning motivation: the pattern of head-curated + long-tail-adaptive is the same shape as head-cache + tail-finetuned model from Instacart but at the configuration-learning layer rather than the model layer (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication). -
The core insight — parameter importance via removal test. "If removing a query parameter changes the content of a page, that parameter is important; if it doesn't, the parameter is noise and can be safely stripped." The operational definition:
- Neutral parameter: removal doesn't change content (
utm_source,session,ref,click_id,tracking). - Non-neutral parameter: removal changes content (
id,color). Canonical concepts/neutral-vs-non-neutral-parameter classification (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).
- Neutral parameter: removal doesn't change content (
-
Classification is per-domain, per-pattern, per-parameter. "This analysis runs independently per domain — each merchant site gets its own MIQPS map, because the same parameter name can be meaningful on one domain and irrelevant on another." And further: "Moreover, the same parameter name can play different roles depending on its context. Consider the parameter
ref: on a product page URL likeexample.com/product?id=42&ref=homepage,refis purely a tracking parameter and is neutral — removing it doesn't change the product displayed. But on a comparison page URL likeexample.com/compare?ref=99, the samerefparameter identifies which items to compare and is non-neutral. By grouping URLs by their full parameter pattern, the algorithm evaluates each parameter within its specific context, correctly classifying it as neutral in one pattern and non-neutral in another." This is why the MIQPS key is(domain, query-parameter-pattern)not just(domain, parameter-name). Canonical concepts/query-parameter-pattern (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication). -
Algorithm — three steps. (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication)
- Collect per-domain URL corpus. Continuous ingestion records each observed URL to a durable per-domain store.
- Group URLs by query parameter pattern. Each pattern = sorted set of param names present. Analyse top K patterns by URL count ("focusing computational resources on the patterns that matter most").
- For each parameter in each pattern, test. Sample up to S URLs with distinct values for the parameter under test. For each sample, compute content ID (visual-content fingerprint) for the original URL and for the URL with the parameter removed. If removal changes the content ID in ≥T% of samples → non-neutral; else neutral.
-
Content ID = visual-content fingerprint. "The content ID is a hash of the page's visual representation, meaning two URLs that render the same visible content will produce the same content ID, even if their underlying HTML differs slightly. This particular fingerprinting approach leverages Pinterest's in-house page rendering infrastructure, which is tailored to our content pipeline. The core MIQPS algorithm, however, is agnostic to how the content fingerprint is produced — it only requires a function that returns the same identifier for the same page content." Suggested alternatives for third parties: "DOM tree hashing, HTTP response body checksums, or even simpler heuristics like comparing the
<title>and Open Graph metadata across URL variants." The key contract is same content → same ID, not how that ID is computed. Canonical concepts/content-id-fingerprint (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication). -
Why not
<link rel="canonical">? "A natural question is: why not simply use the canonical URL declared in the page's HTML (via the<link rel='canonical'>tag) to resolve duplicates? … In practice, however, canonical URLs are unreliable at scale. Many merchant sites omit them entirely, set them incorrectly (e.g., pointing every page to the homepage), or include tracking parameters in the canonical URL itself. Because we cannot assume canonical URLs are present or correct across the long tail of merchant domains, MIQPS uses visual content comparison as a ground-truth signal that works regardless of how well-maintained a site's metadata is." Canonical concepts/canonical-url-unreliability — metadata-declared canonicality cannot be trusted across the long tail; ground-truth derived from content itself (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication). -
Early-exit + conservative default. "Early exit optimization: If the mismatch rate already exceeds T% after N successful tests, we stop testing that parameter early. This avoids unnecessary page rendering calls for parameters that are clearly non-neutral. Conservative default: When fewer than N sample URLs are available for a parameter, it is treated as non-neutral by default. The system errs on the side of keeping parameters rather than dropping ones that might matter." The asymmetry is load-bearing: dropping a non-neutral parameter silently merges distinct items (bad); keeping a neutral parameter wastes a rendering call (tolerable). Every tuning choice biases toward the tolerable failure mode (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).
-
Multi-layer normalisation — MIQPS is the long-tail tier. "MIQPS does not operate in isolation. In production, URL normalization combines static rules with the dynamically computed MIQPS. Static rules capture known conventions — curated allowlists for recognized e-commerce platforms and regex patterns for widely used parameter naming schemes. These rules handle cases where we already have high confidence about which parameters matter. MIQPS complements these static rules by covering the long tail of domains where no predefined rules exist. A URL parameter is kept if it is matched by either the static rules or the MIQPS non-neutral set. Only parameters that pass neither check are stripped." Canonical patterns/multi-layer-normalization-strategy — the OR semantics on keep-decisions is critical: static + MIQPS each have veto rights, which biases the ensemble toward over-keeping (aligned with the asymmetric-cost framing above) (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).
-
Anomaly detection — three-rule comparison before publish. "Computing MIQPS is inherently dependent on external page rendering. Pages can change, rendering infrastructure can have transient issues, and a domain's URL structure can shift between analysis runs. Without safeguards, a bad MIQPS computation could cause the system to start dropping parameters that are actually important — leading to content deduplication errors and degraded catalog quality." Comparison rules:
- Parameter removed from non-neutral set → anomaly. "This is the dangerous case — it means we would start stripping a parameter that we previously determined was important."
- Parameter added to non-neutral set → not anomaly. "It simply means we discovered a new important parameter, and the worst case is keeping slightly more parameters than necessary."
- Pattern removed entirely → not anomaly. "Patterns can naturally disappear as a domain's URL structure evolves." Threshold: "If more than A% of existing patterns are flagged as anomalous, the entire MIQPS update is rejected and the previous version is retained." Canonical patterns/conservative-anomaly-gated-config-update — every classification of what counts as an anomaly aligns with the cost asymmetry (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).
-
Three-phase architecture — continuous ingest → offline compute → runtime lookup. (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication)
- Content Ingestion: "As URLs are processed from domains, the system writes each unique URL to a per-domain corpus stored in S3. This happens continuously as part of normal content processing."
- MIQPS Computation: "After a content processing cycle completes for a domain, an offline job is triggered. This job downloads the URL corpus, runs the MIQPS algorithm (grouping, sampling, content ID comparison), performs anomaly detection, and publishes the result to both a config store (for runtime consumption) and S3 (for archival and debugging)."
- URL Normalization: "At runtime, the URL processor loads the MIQPS map from the config store at initialization. For each URL it processes, it looks up the query pattern, retrieves the non-neutral parameter set, and strips all parameters not matched by any of the four normalization layers." "This separation of concerns means the expensive content ID comparison happens offline and asynchronously, while runtime URL normalization is a fast, in-memory lookup." Canonical patterns/offline-compute-online-lookup-config instance.
-
Offline over real-time — three load-bearing reasons. "An alternative design would be to determine parameter importance in realtime — rendering the page with and without each parameter at the moment a URL is first encountered. This would eliminate staleness entirely and provide immediate coverage for newly discovered domains. However, we chose the offline approach for several reasons: Latency — Each content ID computation requires rendering a full page, which takes seconds. Testing every parameter in a URL would multiply this cost, adding unacceptable latency to the content processing pipeline. Cost — Offline analysis scales with the number of domains, while realtime analysis would scale with the number of URLs — orders of magnitude more expensive. Reliability — Transient rendering failures in an offline job are isolated and retryable. In a realtime path, they would directly block content processing." And the acceptability argument: "URL parameter conventions change infrequently — on the order of weeks or months. The small amount of staleness between computation cycles is an acceptable tradeoff for the massive savings in cost, latency, and operational complexity." Canonical concepts/offline-compute-online-lookup framing (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).
Systems introduced¶
- systems/pinterest-miqps — the Minimal Important Query Param Set algorithm + offline job + published MIQPS map. Per-domain, per-query-parameter-pattern classification of each parameter as neutral (safe to strip) or non-neutral (keep). Runs offline against a per-domain URL corpus stored in S3; publishes to a config store + archives to S3; anomaly-gated before publish.
- systems/pinterest-url-normalizer — the runtime URL processor. Loads the MIQPS map at initialisation, processes each incoming URL by looking up its query parameter pattern, retrieving the non-neutral parameter set, and stripping all parameters not preserved by any of the four normalisation layers (static allowlists + regex + MIQPS + conservative defaults). Fast in-memory lookup.
- systems/pinterest-content-ingestion-pipeline — Pinterest's content acquisition pipeline. The system whose cost is being optimised — fetches, renders, and processes content from merchant URLs; without URL normalisation, would render the same product page once per URL variant. Also the upstream producer of the per-domain URL corpus MIQPS consumes.
Concepts extracted¶
- concepts/url-normalization — the class of problem MIQPS solves: collapse many URLs for the same underlying content into one canonical form before expensive downstream work. Upstream of content-ID-based deduplication, which catches duplicates after paying the render cost.
- concepts/content-id-fingerprint — same-content → same-ID function over rendered page output. Pinterest uses a hash of the page's visual representation; the algorithm is agnostic to the specific fingerprinter (DOM hash, body checksum,
<title>+ Open Graph metadata all valid substitutes). - concepts/query-parameter-pattern — the sorted set of query-parameter names present in a URL. MIQPS's grouping key alongside the domain, because the same parameter name can play different roles depending on the pattern it sits in (Pinterest's
refon a product page vs compare page). - concepts/neutral-vs-non-neutral-parameter — the classification Pinterest's removal-test assigns to each
(domain, pattern, parameter)triple. Neutral = safe to strip; non-neutral = must be preserved. - concepts/canonical-url-unreliability — why
<link rel="canonical">isn't enough. Metadata-declared canonicality is unreliable across the long tail (omitted / misconfigured / contaminated with tracking params); content-based fingerprinting is the ground-truth alternative. - concepts/anomaly-gated-config-update — the discipline of comparing newly-computed config against previously-published config before allowing the update, with explicit rules that bias toward the tolerable failure mode (MIQPS: additions + pattern-disappearances are fine; removals from non-neutral set are the dangerous case).
- concepts/offline-compute-online-lookup — the architectural split where expensive analysis runs asynchronously offline and produces a small configuration artefact that's loaded into runtime memory for fast lookup. Acceptable when the underlying phenomenon changes slowly (Pinterest: URL parameter conventions change on the order of weeks / months).
Patterns introduced¶
- patterns/per-domain-adaptive-config-learning — hybrid head-curated + long-tail-learned configuration strategy. Static rules for well-known platforms (Shopify
variants, Salesforce Commerce Cloudstart/sz/prefn1/prefv1) + empirical learning for the long tail of domains. Same shape as head-cache / tail-finetuned-model patterns but applied at the config / rules layer. - patterns/visual-fingerprint-based-parameter-classification — use content-rendering as ground truth for "does parameter X matter?" by computing a content ID with and without the parameter. The rendering cost is amortised across many URLs sharing the same pattern on the same domain via offline batch analysis.
- patterns/multi-layer-normalization-strategy — OR semantics on keep-decisions across multiple independent classifiers. Pinterest: static allowlist + regex + MIQPS + conservative default; a parameter is kept if any layer votes keep, stripped only if all layers vote strip. Bias toward over-keeping because the asymmetric-cost math favours the tolerable failure mode.
- patterns/conservative-anomaly-gated-config-update — before publishing a new version of a learned config, compare it against the previous version and reject if more than A% of existing entries have a "dangerous" change. Asymmetric classification: some changes (new entries, disappeared entries) are explicitly allowed; one specific change direction (previously-non-neutral flipping to neutral) is the anomaly.
- patterns/offline-compute-online-lookup-config — three-phase architecture: continuous runtime producers write observations to a durable store (S3); an offline job consumes observations, computes an expensive analytical artefact (MIQPS map), and publishes it to a small fast config store + archives; runtime consumers load the artefact at init and do in-memory lookups. Canonical "hot/cold path separation" for learned configuration.
Operational numbers¶
- Scale (qualitative only): "Pinterest ingests content from a large number of domains"; "with a large number of domains and billions of URLs." Conclusion-section phrasing: "hundreds of thousands of domains." No exact domain count disclosed.
- Algorithm tunables: referred to but not numerically disclosed — K (top patterns per domain to analyse), S (max samples per parameter), T (neutral/non-neutral mismatch-rate threshold as %), N (minimum tests for early-exit and conservative-default), A (anomaly-fraction-rejection threshold as %). Every number left as a tunable.
- Pattern-change cadence: "URL parameter conventions change infrequently — on the order of weeks or months." Load-bearing rationale for offline over realtime.
- Content-ID compute cost: "Each content ID computation requires rendering a full page, which takes seconds." Load-bearing cost driver.
- Storage: per-domain URL corpus + MIQPS archive → S3. Runtime MIQPS → config store (unnamed).
- No quantified wins disclosed: no deduplication ratio, cost saving, latency improvement, catalog-quality metric, or before/after numbers.
Caveats¶
- No numbers. The post is architecture + algorithm + rationale; no numerical outcomes reported. No deduplication-ratio disclosed, no compute-saved %, no render-cost savings, no catalog-quality metric, no false-positive-parameter rate, no before/after comparison. Tier-2 post pitched at the conceptual / algorithmic level, not the production-outcomes level.
- All tunables abstract. K (top patterns), S (sample size), T (mismatch threshold), N (early-exit / conservative-default bound), A (anomaly rejection threshold) are all referred to but not disclosed. Reproducing the system means choosing all five yourself.
- Rendering-infra assumed. Pinterest says the algorithm is agnostic to the fingerprinter, but the practical cost model (seconds-per-render, offline-only) assumes you have rendering infrastructure at the scale to fingerprint at sample rate × domains × patterns × parameters. Third parties adopting MIQPS with lightweight fingerprinters (DOM hash /
<title>+ Open Graph) will land at a different cost-latency regime. - Pattern-count scalability unstated. "The top K patterns by URL count are selected for analysis, focusing computational resources on the patterns that matter most." Domains with more than K distinct patterns have unanalysed patterns that default to conservative (keep all params). Fraction of URLs covered vs uncovered not disclosed.
- Drift-handling through staleness. If a domain changes its URL structure between analysis runs, the MIQPS map is stale. "URL parameter conventions change infrequently" assertion supports this, but no SLO on freshness / no canary metrics / no alerting story for drift-caused deduplication errors disclosed.
- Content ID collision risk unaddressed. If two different product pages accidentally render to the same content ID (e.g. because variants differ only in a small visual region the fingerprint doesn't emphasise), the classifier will incorrectly flag the differentiating parameter as neutral. Pinterest doesn't discuss collision characterisation.
- Shopify + Salesforce Commerce Cloud named but not audited. Static allowlist examples named; no discussion of how static rules are maintained, how new platforms enter the allowlist, or whether MIQPS itself is used to validate the static rules.
- Offline cycle cadence unstated. "After a content processing cycle completes for a domain, an offline job is triggered." Per-domain trigger implied, but cadence (hourly? daily? per-ingest-burst?) not disclosed.
- No MIQPS failure modes from production. The anomaly detection implies they've seen bad MIQPS computations in the wild, but the post doesn't walk through a real example (what went wrong, was anomaly detection triggered, what did they do).
- Author list = Content Acquisition and Media Platform. Shanhai Liao, Di Ruan, Evan Li. Sibling to Pinterest's broader content-acquisition infrastructure; no cross-references to other Pinterest content-ingestion posts on the wiki corpus.
Source¶
- Original: https://medium.com/pinterest-engineering/smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication-at-pinterest-4aa42e807d7d?source=rss----4c5a5f6279b6---4
- Raw markdown:
raw/pinterest/2026-04-20-smarter-url-normalization-at-scale-how-miqps-powers-content-9671bd07.md
Related¶
- Pinterest wiki corpus: companies/pinterest
- Upstream ingestion pipeline this optimises: systems/pinterest-content-ingestion-pipeline
- Sibling head-curated + long-tail-learned patterns: patterns/head-cache-plus-tail-finetuned-model (Instacart — different layer, same shape); patterns/per-domain-adaptive-config-learning (this post)
- Storage substrate: systems/aws-s3