Your company just acquired a competitor. Now you have two customer databases - yours with 50,000 records, theirs with 40,000.
Marketing says 'just merge them.' But John Smith in System A with jsmith@gmail.com might be the same person as Jonathan Smith in System B with john.s@work.com. Or they might be completely different people.
You can't email the same customer twice with conflicting offers. You can't have two sales reps calling the same account. You need to match and merge - but getting it wrong destroys customer trust.
Matching finds pairs that represent the same entity. Merging combines them without losing information either side had.
LAYER 1 - Record matching/merging creates unified records from fragmented sources.
Record matching is the process of identifying which records from different sources represent the same real-world entity. It goes beyond exact matches - it handles variations in names, addresses, typos, and incomplete data. A good matching algorithm says 'these two records are 94% likely to be the same person.'
Record merging is what happens next. Once you've identified a match, you need to combine the records intelligently. Which email address is more recent? Which phone number is the primary? Do you keep both addresses or pick one? Merging creates a single 'golden record' that contains the best information from all sources.
The goal is to go from 'Customer A in System 1, Customer B in System 2' to 'This is one customer, and here's everything we know about them.'
Record matching/merging solves the universal problem of data fragmentation: how do you unify scattered information about the same entity?
Define matching criteria (name similarity, email overlap, address proximity). Score pairs of records on likelihood of being matches. Set a threshold for 'definite match' vs 'needs review.' For confirmed matches, apply merge rules to create a golden record. Track the source of each field.
3 customers in System A, 4 in System B. Some are the same person across systems. Adjust the threshold to control matching sensitivity.
Exact matches on key fields
Match when specific fields are identical: same email address, same SSN, same phone number. Simple and fast. Works when you have reliable unique identifiers. Misses matches when data has typos or variations.
Scoring based on multiple signals
Score potential matches across multiple fields using similarity metrics. 'Name is 85% similar, email domain matches, same city' might score 92%. Set thresholds for auto-match, auto-reject, and manual review. More flexible but requires tuning.
Trained on your labeled matches
Train a model on examples of known matches and non-matches from your data. The model learns which field combinations indicate matches in your specific domain. Most accurate but requires training data and model maintenance.
After acquiring a competitor, the combined customer databases had massive overlap. Record matching identified 18,000 pairs that were the same customer across systems. Merging created golden records with the best data from both sources. no more duplicate outreach or conflicting account histories.
Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed
Animated lines show direct connections · Hover for detailsTap for details · Click to learn more
You tried to match 'John Smith' to 'JOHN SMITH' and they didn't match because the comparison was case-sensitive. Now you have two records for the same person, and they're getting duplicate emails. The sales team just contacted the same lead twice with different pricing.
Instead: Always normalize data before matching: lowercase, trim whitespace, standardize formats. The matching step should compare normalized values, not raw input.
You merged two customer records and kept only the 'newer' address. Turns out that was the customer's vacation home - you just lost their primary shipping address. Now packages are going to the wrong place and the original address is gone forever.
Instead: Preserve source data. Keep a link to original records. Store all values with timestamps and sources. Let business rules decide which to display, but never throw away data during merge.
You set the match threshold to 80% without testing. Now you're auto-matching 'John Smith in NYC' with 'John Smith in LA' - different people, same common name. Your single customer view is actually multiple customers mashed together. Marketing is sending personalized emails with the wrong purchase history.
Instead: Sample potential matches at different thresholds. Review false positives and false negatives. Tune thresholds based on your data, not defaults. Consider a manual review queue for borderline cases.
You've learned how to identify and combine records that represent the same entity. The natural next step is deduplication - systematically removing duplicates to maintain data quality at scale.