Use case — Duplicates

Find the duplicates hiding in
your catalog before a shopper does.

Near-duplicate products dilute your catalog signal for AI shopping agents — and they are invisible to manual review once your store passes 200 SKUs.

Install Legible

The problem

Near-duplicate products enter Shopify catalogs in three predictable ways. First, imports: a supplier sends a CSV with 600 products, you upload it, then the supplier updates the CSV and you upload again — now you have two versions of the same item, subtly different titles, possibly different prices. Second, bulk operations: a developer migrates products from a legacy platform and the mapping produces duplicates for color variants that should have been Shopify variant options instead of separate products. Third, theme variants leaking: some storefront themes, particularly headless setups with custom PDPs, create ghost products as artifacts of build processes.

The result is a catalog that looks cluttered from the outside — and AI shopping agents penalize cluttered catalogs. When an agent sees three products that appear to describe the same item, it does one of two things: it surfaces all three to the shopper (creating cart confusion) or it surfaces none of them (treating them as unreliable data). Neither outcome is good for conversion.

Why agents penalize duplicates

AI shopping agents reason about catalog quality before they reason about individual products. A store with 40 near-identical SKUs from a botched import signals low data quality. The agent's ranking logic depresses the entire store's products, not just the duplicates themselves. This is not a hypothetical — it mirrors how search engine crawlers have penalized duplicate content for years. The mechanism is different, but the outcome is the same.

There is also a cart confusion problem. A shopper asking for "the sand linen shirt" gets back three results: "Linen Shirt — Sand," "Washed Linen Shirt (Sand)," and "Summer Linen Shirt — Natural." They look the same. They may be the same. The agent cannot confidently recommend any of them, so it picks a competitor with a clean canonical listing instead.

How Legible finds them

Legible embeds every product in your catalog using Voyage AI's voyage-3-lite model — a dense vector embedding that captures semantic meaning, not just keyword overlap. Two products with titles "Linen Shirt Sand" and "Sand Linen Shirt — Washed" will score near-identical cosine similarity even though they share no exact substring. That is the point: the engine finds meaning-level duplicates, not just string matches.

Products are then clustered by similarity score. Any cluster above the 0.88 threshold appears in your duplicates report. Each cluster shows you:

Every product in the group with its SKU and last updated date
A recommended action per product — keep the canonical, merge the near-identical, archive the variant nobody is buying
A confidence score for the cluster (0.88 to 1.0)
Deep-links into Shopify admin so your team can action the merge or differentiation directly

Sample output

Linen Shirt — Sand

SKU-1042 · updated 3 days ago

keep

Washed Linen Shirt (Sand)

SKU-2210 · updated 6 months ago

merge

Summer Linen Shirt — Natural

SKU-2844 · updated 11 months ago

Run your first duplicate scan.

14-day free trial. You approve every write. No developer required.