You pull a list of customers to send an email campaign.
2,847 contacts. But when you look closer, there are three "John Smith" entries with slight email variations.
You send anyway. John gets three copies. He unsubscribes from all of them.
Your data looked complete. It was actually polluted.
DATA QUALITY - Duplicates compound every downstream process that touches your data.
Deduplication is the process of identifying records that represent the same real-world entity and consolidating them into one. Not obvious copies where every field matches, but functional duplicates: 'John Smith' and 'J. Smith' at the same address. 'Acme Corp' and 'ACME Corporation' with the same phone number.
The challenge isn't deletion. It's detection. Two records might share an email but have different names. Same name, different phone numbers. Which fields matter? How similar is 'similar enough'? These aren't technical questions. They're business decisions about what makes something the same.
Skip this step and every report inflates counts, every email sends multiple times, and every AI system trains on noise. Get it right and you have a clean foundation everything else can trust.
Deduplication solves a universal problem: how do you find things that are functionally identical even when they look different on the surface?
Define what "same" means (matching rules). Compare records against those rules. When matches are found, decide which record wins (survivor selection) and what happens to the rest (merge or delete).
Drag the slider to change the similarity threshold. See how different settings catch more duplicates (but risk false positives) or fewer duplicates (but miss real ones).
Loose = catches more, but may merge unrelated records. Strict = fewer false positives, but misses subtle duplicates.
Same values, same record
Compare specific fields byte-for-byte. If email addresses match exactly, it's a duplicate. Fast and certain, but misses 'john@acme.com' vs 'john@acme.co' or slight typos.
Close enough counts
Use similarity algorithms (Levenshtein, Jaro-Winkler, phonetic matching) to score how alike two values are. 'Jon Smith' and 'John Smith' score 90% similar. You set the threshold.
Business logic decides
Combine multiple conditions: 'Same phone number OR (same name AND same city).' Weights different fields by importance. Name match + address match might count more than name match alone.
Marketing needs a headcount for campaign budgeting. Sales says 2,847 contacts. But 'John Smith', 'J. Smith', and 'John A. Smith' are all the same person. This flow cleans the list and gives you the real number: 2,312 unique customers.
Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed
Animated lines show direct connections · Hover for detailsTap for details · Click to learn more
You merge two records. A month later, someone asks what happened to customer #4892. You have no idea. It might have been merged into #4891, or deleted, or maybe it was the survivor. Without logs, you're guessing.
Instead: Log every merge: which records, which survived, what data was combined, when and why.
You cleaned your customer database six months ago. Since then, 500 new leads came in. Nobody ran the deduplication rules on them. You're back to sending duplicate emails.
Instead: Run deduplication on ingest (new records) and periodically on the full dataset.
Two customer records exist. You pick one as the winner and hard-delete the other. A week later, an old order reference points to the deleted ID. Now you have orphaned data.
Instead: Soft-delete or redirect. Keep the losing record with a pointer to the survivor.
You've learned how to find and consolidate duplicate records. The natural next step is understanding how to recognize when different records represent the same real-world entity across systems.