KnowledgeLayer 1Entity & Identity

Deduplication

You pull a list of customers to send an email campaign.

2,847 contacts. But when you look closer, there are three "John Smith" entries with slight email variations.

You send anyway. John gets three copies. He unsubscribes from all of them.

Your data looked complete. It was actually polluted.

8 min read

intermediate

Relevant If You're

Syncing data from multiple sources

Running reports that need accurate counts

Sending communications to customers

DATA QUALITY - Duplicates compound every downstream process that touches your data.

Where This Sits

Category 1.3: Entity & Identity

Layer 1

Data Infrastructure

Entity Resolution Record Matching/Merging Deduplication Master Data Management Relationship Mapping

Explore all of Layer 1

What It Is

Finding and removing records that represent the same thing

Deduplication is the process of identifying records that represent the same real-world entity and consolidating them into one. Not obvious copies where every field matches, but functional duplicates: 'John Smith' and 'J. Smith' at the same address. 'Acme Corp' and 'ACME Corporation' with the same phone number.

The challenge isn't deletion. It's detection. Two records might share an email but have different names. Same name, different phone numbers. Which fields matter? How similar is 'similar enough'? These aren't technical questions. They're business decisions about what makes something the same.

Skip this step and every report inflates counts, every email sends multiple times, and every AI system trains on noise. Get it right and you have a clean foundation everything else can trust.

The Lego Block Principle

Deduplication solves a universal problem: how do you find things that are functionally identical even when they look different on the surface?

The core pattern:

Define what "same" means (matching rules). Compare records against those rules. When matches are found, decide which record wins (survivor selection) and what happens to the rest (merge or delete).

Where else this applies:

File systems - Finding files with same content but different names.

Search engines - Collapsing near-duplicate pages into one result.

Email clients - Detecting and merging duplicate contacts.

Version control - Identifying identical code blocks across files.

Interactive: Adjust the Threshold

Watch duplicates appear and disappear

Drag the slider to change the similarity threshold. See how different settings catch more duplicates (but risk false positives) or fewer duplicates (but miss real ones).

Similarity Threshold80%

60% (Loose)95% (Strict)

Loose = catches more, but may merge unrelated records. Strict = fewer false positives, but misses subtle duplicates.

Input Records

Duplicate Clusters

Would Be Merged

Unique After Dedup

Duplicate Clusters Found

No duplicates found at 80% threshold

Unique Records

John Smithjohn@acme.com

J. Smithjsmith@acme.com

John A. Smithjohn.smith@acme.co

Sarah Johnsonsarah@techstart.io

Sara Johnsonsj@techstart.io

Mike Williamsmike@example.com

Try it: Drag the threshold slider above. Watch how duplicate clusters appear and disappear. Notice how a 5% change can be the difference between catching a duplicate and missing it.

How It Works

Three approaches to finding duplicates

Exact Matching

Same values, same record

Compare specific fields byte-for-byte. If email addresses match exactly, it's a duplicate. Fast and certain, but misses 'john@acme.com' vs 'john@acme.co' or slight typos.

Pro: No false positives, very fast

Con: Misses obvious duplicates with minor variations

Fuzzy Matching

Close enough counts

Use similarity algorithms (Levenshtein, Jaro-Winkler, phonetic matching) to score how alike two values are. 'Jon Smith' and 'John Smith' score 90% similar. You set the threshold.

Pro: Catches typos and variations

Con: Threshold tuning requires iteration

Rule-Based Matching

Business logic decides

Combine multiple conditions: 'Same phone number OR (same name AND same city).' Weights different fields by importance. Name match + address match might count more than name match alone.

Pro: Aligns with how your business thinks about identity

Con: Requires upfront rule definition and maintenance

Connection Explorer

"How many unique customers do we actually have?"

Marketing needs a headcount for campaign budgeting. Sales says 2,847 contacts. But 'John Smith', 'J. Smith', and 'John A. Smith' are all the same person. This flow cleans the list and gives you the real number: 2,312 unique customers.

Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed

Accurate Customer Count

Outcome

React Flow

Foundation

Data Infrastructure

Intelligence

Outcome

Animated lines show direct connections · Hover for detailsTap for details · Click to learn more

Upstream (Requires)

Databases (Relational)Ingestion Patterns

Downstream (Enables)

Entity Resolution Master Data Management

Common Mistakes

What breaks when deduplication goes wrong

Don't dedupe without keeping audit trails

You merge two records. A month later, someone asks what happened to customer #4892. You have no idea. It might have been merged into #4891, or deleted, or maybe it was the survivor. Without logs, you're guessing.

Instead: Log every merge: which records, which survived, what data was combined, when and why.

Don't run deduplication once and forget about it

You cleaned your customer database six months ago. Since then, 500 new leads came in. Nobody ran the deduplication rules on them. You're back to sending duplicate emails.

Instead: Run deduplication on ingest (new records) and periodically on the full dataset.

Don't delete the 'losing' record completely

Two customer records exist. You pick one as the winner and hard-delete the other. A week later, an old order reference points to the deleted ID. Now you have orphaned data.

Instead: Soft-delete or redirect. Keep the losing record with a pointer to the survivor.

What's Next

Now that you understand deduplication

You've learned how to find and consolidate duplicate records. The natural next step is understanding how to recognize when different records represent the same real-world entity across systems.

Recommended Next

Entity Resolution

Identifying when different records refer to the same real-world entity

Back to Learning Hub

Deduplication

You pull a list of customers to send an email campaign.

2,847 contacts. But when you look closer, there are three "John Smith" entries with slight email variations.

You send anyway. John gets three copies. He unsubscribes from all of them.

Your data looked complete. It was actually polluted.

8 min read

intermediate

Finding and removing records that represent the same thing

Skip this step and every report inflates counts, every email sends multiple times, and every AI system trains on noise. Get it right and you have a clean foundation everything else can trust.

Watch duplicates appear and disappear

Drag the slider to change the similarity threshold. See how different settings catch more duplicates (but risk false positives) or fewer duplicates (but miss real ones).

Similarity Threshold80%

60% (Loose)95% (Strict)

Loose = catches more, but may merge unrelated records. Strict = fewer false positives, but misses subtle duplicates.

Input Records

Duplicate Clusters

Would Be Merged

Unique After Dedup

Duplicate Clusters Found

No duplicates found at 80% threshold

Unique Records

John Smithjohn@acme.com

J. Smithjsmith@acme.com

John A. Smithjohn.smith@acme.co

Sarah Johnsonsarah@techstart.io

Sara Johnsonsj@techstart.io

Mike Williamsmike@example.com

Try it: Drag the threshold slider above. Watch how duplicate clusters appear and disappear. Notice how a 5% change can be the difference between catching a duplicate and missing it.

Three approaches to finding duplicates

Exact Matching

Same values, same record

Compare specific fields byte-for-byte. If email addresses match exactly, it's a duplicate. Fast and certain, but misses 'john@acme.com' vs 'john@acme.co' or slight typos.

Pro: No false positives, very fast

Con: Misses obvious duplicates with minor variations

Fuzzy Matching

Close enough counts

Use similarity algorithms (Levenshtein, Jaro-Winkler, phonetic matching) to score how alike two values are. 'Jon Smith' and 'John Smith' score 90% similar. You set the threshold.

Pro: Catches typos and variations

Con: Threshold tuning requires iteration

Rule-Based Matching

Business logic decides

Combine multiple conditions: 'Same phone number OR (same name AND same city).' Weights different fields by importance. Name match + address match might count more than name match alone.

Pro: Aligns with how your business thinks about identity

Con: Requires upfront rule definition and maintenance

"How many unique customers do we actually have?"

Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed

Accurate Customer Count

Outcome

React Flow

Foundation

Data Infrastructure

Intelligence

Outcome

Animated lines show direct connections · Hover for detailsTap for details · Click to learn more

What breaks when deduplication goes wrong

Don't dedupe without keeping audit trails

Instead: Log every merge: which records, which survived, what data was combined, when and why.

Don't run deduplication once and forget about it

You cleaned your customer database six months ago. Since then, 500 new leads came in. Nobody ran the deduplication rules on them. You're back to sending duplicate emails.

Instead: Run deduplication on ingest (new records) and periodically on the full dataset.

Don't delete the 'losing' record completely

Two customer records exist. You pick one as the winner and hard-delete the other. A week later, an old order reference points to the deleted ID. Now you have orphaned data.

Instead: Soft-delete or redirect. Keep the losing record with a pointer to the survivor.