SYSTEM Cited by 1 source
Pinterest MIQPS — Minimal Important Query Param Set¶
What it is¶
MIQPS (Minimal Important Query Param Set) is Pinterest's per-domain
algorithm + offline job + published configuration artefact for learning
which URL query parameters actually matter for content identity on each
merchant domain. Its output — the MIQPS map — maps (domain,
query-parameter-pattern) to the set of non-neutral parameters that must
be preserved during URL normalisation; every other parameter is safe to
strip.
The system exists because Pinterest ingests content from a long tail of
merchant domains whose URL parameter conventions vary wildly. Static
allowlists (Shopify variants, Salesforce Commerce Cloud start / sz
/ prefn1 / prefv1) handle the head but "cannot scale to cover" the
long tail. MIQPS is the learned-configuration layer that complements the
static rules (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).
Algorithm¶
Three steps per domain:
- Collect URL corpus. As the content ingestion pipeline processes URLs, each unique URL is written to a per-domain corpus on S3.
- Group by query parameter pattern. Pattern = the sorted set of parameter names present. Select the top K patterns by URL count, "focusing computational resources on the patterns that matter most."
- For each parameter, test. Sample up to S URLs with distinct
values for the parameter. For each sample, compute the content ID
of the original URL and of the URL with the parameter removed. If
removal changes the content ID in ≥T% of samples → classify
non-neutral; else neutral. The classification key is
(domain, pattern, parameter), because the same parameter name can play different roles across patterns on the same domain (Pinterest's example:refis neutral on a product page but non-neutral on a comparison page).
Two practical optimisations:
- Early exit: if mismatch rate already exceeds T% after N successful tests, stop — the parameter is clearly non-neutral.
- Conservative default: if fewer than N samples are available for a parameter, treat as non-neutral. "The system errs on the side of keeping parameters rather than dropping ones that might matter."
Anomaly detection¶
Before publishing a new MIQPS, compare against the previously-published version. Three rules:
- Parameter removed from non-neutral set → anomaly (dangerous — would start stripping a parameter previously judged important).
- Parameter added to non-neutral set → not anomaly (we just discovered a new important parameter; worst case we keep slightly more than necessary).
- Pattern removed entirely → not anomaly (domains legitimately evolve).
Threshold: if more than A% of existing patterns are flagged as anomalous, reject the new MIQPS and retain the previous version.
This is canonical patterns/conservative-anomaly-gated-config-update — the rule asymmetry mirrors the cost asymmetry of the underlying problem (silently dropping an important parameter corrupts the catalog; keeping an unimportant parameter just wastes a render).
Deployment topology¶
Three phases (canonical patterns/offline-compute-online-lookup-config):
- Content Ingestion (continuous) — the Pinterest content pipeline writes each observed URL to the per-domain corpus on S3 as a side effect of normal processing. No extra runtime overhead beyond the append.
- MIQPS Computation (offline, per-domain-cycle-triggered) — the offline job downloads the URL corpus, runs the algorithm (grouping + sampling + content-ID comparison), runs anomaly detection against the previous version, publishes the result to the config store (for runtime consumption) + S3 (archival + debugging).
- URL Normalization (runtime) — the URL processor loads the MIQPS map from the config store at init and performs in-memory lookups for every incoming URL. Lookup is fast; the expensive rendering + content-ID work has already happened offline.
Why offline, not realtime¶
"Each content ID computation requires rendering a full page, which takes seconds. Testing every parameter in a URL would multiply this cost, adding unacceptable latency … Offline analysis scales with the number of domains, while realtime analysis would scale with the number of URLs — orders of magnitude more expensive … Transient rendering failures in an offline job are isolated and retryable. In a realtime path, they would directly block content processing."
The staleness argument: "URL parameter conventions change infrequently — on the order of weeks or months. The small amount of staleness between computation cycles is an acceptable tradeoff for the massive savings in cost, latency, and operational complexity." Canonical concepts/offline-compute-online-lookup framing (Source: sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication).
Tunables¶
Pinterest names but does not disclose values for:
| Knob | Meaning |
|---|---|
| K | Top patterns per domain to analyse (by URL count) |
| S | Max URL samples per parameter |
| T | Mismatch-rate threshold (%) for classifying non-neutral |
| N | Minimum tests for early-exit / conservative-default floor |
| A | Anomaly-fraction (%) above which new MIQPS is rejected |
Scale¶
Qualitative only: "Pinterest ingests content from a large number of domains"; "hundreds of thousands of domains"; "billions of URLs." No exact domain count, deduplication-ratio, cost saving, or latency improvement disclosed in the post.
Load-bearing architectural choices¶
- Per-domain, per-pattern classification (not per-parameter-name globally) — because the same parameter name plays different roles on different domains and in different patterns.
- Visual-content fingerprinting as ground truth — sidesteps the
unreliable
<link rel="canonical">approach used by head-focused systems. - Multi-layer OR semantics (see patterns/multi-layer-normalization-strategy) — a parameter is kept if any layer votes keep. Bias toward over-keeping matches cost asymmetry.
- Anomaly-gated publish — the system never regresses by accident via a bad MIQPS compute; at worst the old config is retained.
Seen in¶
- sources/2026-04-20-pinterest-smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication — introduction + algorithm + architecture + production rationale.
Related¶
- Runtime consumer: systems/pinterest-url-normalizer
- Upstream producer + downstream consumer: systems/pinterest-content-ingestion-pipeline
- Archival + corpus substrate: systems/aws-s3
- Concepts: concepts/url-normalization, concepts/query-parameter-pattern, concepts/neutral-vs-non-neutral-parameter, concepts/content-id-fingerprint, concepts/canonical-url-unreliability, concepts/anomaly-gated-config-update, concepts/offline-compute-online-lookup
- Patterns: patterns/per-domain-adaptive-config-learning, patterns/visual-fingerprint-based-parameter-classification, patterns/multi-layer-normalization-strategy, patterns/conservative-anomaly-gated-config-update, patterns/offline-compute-online-lookup-config
- Company: companies/pinterest