KnowledgeLayer 1Entity & Identity

Record Matching/Merging

Your company just acquired a competitor. Now you have two customer databases - yours with 50,000 records, theirs with 40,000.

Marketing says 'just merge them.' But John Smith in System A with jsmith@gmail.com might be the same person as Jonathan Smith in System B with john.s@work.com. Or they might be completely different people.

You can't email the same customer twice with conflicting offers. You can't have two sales reps calling the same account. You need to match and merge - but getting it wrong destroys customer trust.

Matching finds pairs that represent the same entity. Merging combines them without losing information either side had.

8 min read

intermediate

Relevant If You're

Merging data from acquisitions or partnerships

Consolidating records from multiple systems

Building a single customer view from fragmented data

LAYER 1 - Record matching/merging creates unified records from fragmented sources.

Where This Sits

Category 1.3: Entity & Identity

Layer 1

Data Infrastructure

Entity Resolution Record Matching/Merging Deduplication Master Data Management Relationship Mapping

Explore all of Layer 1

What It Is

Two steps: find the pairs, then combine them

Record matching is the process of identifying which records from different sources represent the same real-world entity. It goes beyond exact matches - it handles variations in names, addresses, typos, and incomplete data. A good matching algorithm says 'these two records are 94% likely to be the same person.'

Record merging is what happens next. Once you've identified a match, you need to combine the records intelligently. Which email address is more recent? Which phone number is the primary? Do you keep both addresses or pick one? Merging creates a single 'golden record' that contains the best information from all sources.

The goal is to go from 'Customer A in System 1, Customer B in System 2' to 'This is one customer, and here's everything we know about them.'

The Lego Block Principle

Record matching/merging solves the universal problem of data fragmentation: how do you unify scattered information about the same entity?

The core pattern:

Define matching criteria (name similarity, email overlap, address proximity). Score pairs of records on likelihood of being matches. Set a threshold for 'definite match' vs 'needs review.' For confirmed matches, apply merge rules to create a golden record. Track the source of each field.

Where else this applies:

CRM consolidation - Match customer records across sales, marketing, and support systems.

M&A integration - Combine customer and product databases from acquired companies.

Healthcare records - Link patient data across hospitals, labs, and insurance systems.

Financial services - Match accounts and transactions across banking platforms.

Interactive: Match & Merge Records

Adjust the match threshold and see which records pair up

3 customers in System A, 4 in System B. Some are the same person across systems. Adjust the threshold to control matching sensitivity.

Match Threshold70%

More matches (risky)Fewer matches (conservative)

System A (3 records)

John Smith

jsmith@gmail.com · New York

Sarah Johnson

sarah.j@company.com · Chicago

Mike Williams

mikew@outlook.com · Los Angeles

System B (4 records)

JOHN SMITH

john.s@work.com · NYC

Sarah M. Johnson

sarah.j@company.com · Chicago, IL

Michael Williams

michael.w@gmail.com · LA

Robert Brown

rbrown@email.com · Seattle

Match Results

2 matched0 needs review

Matched

96%

System A:

Sarah Johnson

sarah.j@company.com

System B:

Sarah M. Johnson

sarah.j@company.com

Matched

70%

System A:

John Smith

jsmith@gmail.com

System B:

JOHN SMITH

john.s@work.com

Try it: Adjust the threshold slider to see how matching sensitivity affects results. A lower threshold catches more matches but risks false positives. Higher thresholds are conservative but may miss valid matches.

How It Works

Three matching strategies by data quality

Deterministic Matching

Exact matches on key fields

Match when specific fields are identical: same email address, same SSN, same phone number. Simple and fast. Works when you have reliable unique identifiers. Misses matches when data has typos or variations.

Pro: High precision, fast execution

Con: Misses fuzzy matches, requires clean data

Probabilistic Matching

Scoring based on multiple signals

Score potential matches across multiple fields using similarity metrics. 'Name is 85% similar, email domain matches, same city' might score 92%. Set thresholds for auto-match, auto-reject, and manual review. More flexible but requires tuning.

Pro: Handles variations and fuzzy data

Con: Requires threshold tuning, can produce false positives

ML-Based Matching

Trained on your labeled matches

Train a model on examples of known matches and non-matches from your data. The model learns which field combinations indicate matches in your specific domain. Most accurate but requires training data and model maintenance.

Pro: Highest accuracy for complex cases

Con: Needs labeled training data, ongoing maintenance

Connection Explorer

"90,000 records → 72,000 unique customers → one unified view per person"

After acquiring a competitor, the combined customer databases had massive overlap. Record matching identified 18,000 pairs that were the same customer across systems. Merging created golden records with the best data from both sources. no more duplicate outreach or conflicting account histories.

Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed

Relational DB

Normalization

Entity Resolution

Record Matching/Merging

You Are Here

Master Data

Unified Customer View

Outcome

React Flow

Foundation

Data Infrastructure

Outcome

Animated lines show direct connections · Hover for detailsTap for details · Click to learn more

Upstream (Requires)

Normalization Entity Resolution

Downstream (Enables)

Deduplication Master Data Management

Common Mistakes

What breaks when matching and merging goes wrong

Don't match without normalization first

You tried to match 'John Smith' to 'JOHN SMITH' and they didn't match because the comparison was case-sensitive. Now you have two records for the same person, and they're getting duplicate emails. The sales team just contacted the same lead twice with different pricing.

Instead: Always normalize data before matching: lowercase, trim whitespace, standardize formats. The matching step should compare normalized values, not raw input.

Don't merge destructively

You merged two customer records and kept only the 'newer' address. Turns out that was the customer's vacation home - you just lost their primary shipping address. Now packages are going to the wrong place and the original address is gone forever.

Instead: Preserve source data. Keep a link to original records. Store all values with timestamps and sources. Let business rules decide which to display, but never throw away data during merge.

Don't set thresholds without review samples

You set the match threshold to 80% without testing. Now you're auto-matching 'John Smith in NYC' with 'John Smith in LA' - different people, same common name. Your single customer view is actually multiple customers mashed together. Marketing is sending personalized emails with the wrong purchase history.

Instead: Sample potential matches at different thresholds. Review false positives and false negatives. Tune thresholds based on your data, not defaults. Consider a manual review queue for borderline cases.

What's Next

Now that you understand record matching/merging

You've learned how to identify and combine records that represent the same entity. The natural next step is deduplication - systematically removing duplicates to maintain data quality at scale.

Recommended Next

Deduplication

Systematically detect and remove duplicate records

Back to Learning Hub

Record Matching/Merging

Your company just acquired a competitor. Now you have two customer databases - yours with 50,000 records, theirs with 40,000.

You can't email the same customer twice with conflicting offers. You can't have two sales reps calling the same account. You need to match and merge - but getting it wrong destroys customer trust.

Matching finds pairs that represent the same entity. Merging combines them without losing information either side had.

8 min read

intermediate

Two steps: find the pairs, then combine them

The goal is to go from 'Customer A in System 1, Customer B in System 2' to 'This is one customer, and here's everything we know about them.'

Adjust the match threshold and see which records pair up

3 customers in System A, 4 in System B. Some are the same person across systems. Adjust the threshold to control matching sensitivity.

Match Threshold70%

More matches (risky)Fewer matches (conservative)

System A (3 records)

John Smith

jsmith@gmail.com · New York

Sarah Johnson

sarah.j@company.com · Chicago

Mike Williams

mikew@outlook.com · Los Angeles

System B (4 records)

JOHN SMITH

john.s@work.com · NYC

Sarah M. Johnson

sarah.j@company.com · Chicago, IL

Michael Williams

michael.w@gmail.com · LA

Robert Brown

rbrown@email.com · Seattle

Match Results

2 matched0 needs review

Matched

96%

System A:

Sarah Johnson

sarah.j@company.com

System B:

Sarah M. Johnson

sarah.j@company.com

Matched

70%

System A:

John Smith

jsmith@gmail.com

System B:

JOHN SMITH

john.s@work.com

Three matching strategies by data quality

Deterministic Matching

Exact matches on key fields

Pro: High precision, fast execution

Con: Misses fuzzy matches, requires clean data

Probabilistic Matching

Scoring based on multiple signals

Pro: Handles variations and fuzzy data

Con: Requires threshold tuning, can produce false positives

ML-Based Matching

Trained on your labeled matches

Pro: Highest accuracy for complex cases

Con: Needs labeled training data, ongoing maintenance

"90,000 records → 72,000 unique customers → one unified view per person"

Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed

Relational DB

Normalization

Entity Resolution

Record Matching/Merging

You Are Here

Master Data

Unified Customer View

Outcome

React Flow

Foundation

Data Infrastructure

Outcome

Animated lines show direct connections · Hover for detailsTap for details · Click to learn more

What breaks when matching and merging goes wrong

Don't match without normalization first

Instead: Always normalize data before matching: lowercase, trim whitespace, standardize formats. The matching step should compare normalized values, not raw input.

Don't merge destructively

Instead: Preserve source data. Keep a link to original records. Store all values with timestamps and sources. Let business rules decide which to display, but never throw away data during merge.