Entity & Identity includes five components: entity resolution for identifying when different records refer to the same thing, deduplication for removing duplicate records, record matching and merging for combining records intelligently, master data management for establishing single sources of truth, and relationship mapping for connecting entities together. The right combination depends on your data quality, system count, and whether you need to preserve relationships. Most organizations start with deduplication, then add entity resolution as they scale.
The same customer appears three times in your CRM with slightly different names. Your finance team spends hours matching invoices to accounts.
Marketing says you have 15,000 customers. Sales says 12,000. Finance says 14,200. Everyone is looking at the same data.
Nobody knows which number is right because nobody knows how many duplicates exist.
Your data is not wrong. It is just fragmented into pieces that do not know they belong together.
Part of Layer 1: Data Infrastructure - Where raw data becomes usable.
Entity & Identity is about recognizing that different records represent the same real-world thing and unifying them. Without it, you have scattered data about scattered versions of the same customers, vendors, and products. With it, you have a single, authoritative view.
These components build on each other. Deduplication cleans obvious duplicates. Entity resolution matches across systems. Record merging combines matched records. Master data management governs the result. Relationship mapping connects everything together.
These components form a progression: from finding duplicates to creating unified, connected entities.
Entity Resolution | Matching/Merging | Deduplication | MDM | Relationships | |
|---|---|---|---|---|---|
| Primary Function | Remove duplicates within a dataset | ||||
| Input | Single dataset with potential duplicates | ||||
| Output | Clean dataset without duplicates | ||||
| When to Add | First - clean existing data |
Different symptoms point to different components. Identify what is breaking to know where to focus.
“The same customer has multiple records in my CRM”
Start with deduplication to clean obvious duplicates within a single system.
“I need to match customers across my CRM, billing system, and support platform”
Entity resolution handles matching across systems with different formats.
“I found matches but do not know how to combine them”
Record merging creates golden records from matched pairs.
“Different departments report different customer counts”
MDM establishes one authoritative source everyone references.
“I need to know how customers connect to each other”
Relationship mapping builds the graph of connections between entities.
Answer a few questions to identify which component to focus on first.
Entity identity is not about databases. It is about recognizing that the same real-world thing can appear in many forms and unifying those appearances into one truth.
The same entity exists in multiple forms or systems
Match records, merge them intelligently, establish authority, map connections
One answer to every question about that entity
When different teams report different customer counts from the same data...
That's a master data problem - no single source of truth, so everyone counts differently.
When sales calls a lead that support is already working with...
That's an entity resolution problem - the same person exists as separate records in different systems.
When reconciliation requires manually matching invoices to accounts...
That's a record matching problem - transactions need to link to master records.
When you cannot tell if two vendor records are the same company...
That's a deduplication and relationship mapping problem - fragmented records hide connections.
Which of these sounds most like your current situation?
These mistakes compound. One wrong merge or missed duplicate pollutes everything downstream.
Move fast. Structure data “good enough.” Scale up. Data becomes messy. Painful migration later. The fix is simple: think about access patterns upfront. It takes an hour now. It saves weeks later.
Deduplication removes exact or near-exact duplicate records within a single system. Entity resolution identifies when different records across multiple systems refer to the same real-world entity, even when the data looks completely different. Deduplication is simpler and faster. Entity resolution handles more complex matching across systems with different formats and identifiers.
A golden record is the single, authoritative version of an entity created by merging data from multiple sources. When you have customer data in your CRM, billing system, and support platform, the golden record combines the best information from each: the most accurate email from one, the billing address from another, the support history from a third. All systems then reference this master record.
Use master data management when multiple systems create and update the same entities and you need consistent data across the organization. Signs you need MDM: different departments report different customer counts, the same entity has conflicting data in different systems, or nobody knows which system has the authoritative information. Start with your most critical entity type.
Use probabilistic matching with multiple attributes. Compare names using fuzzy matching algorithms like Jaro-Winkler. Match addresses after standardization. Combine scores across fields: name 85% similar plus same city plus similar phone number might score 90% overall. Set thresholds for auto-match, auto-reject, and manual review. The key is weighting fields by how uniquely they identify entities.
Relationship mapping connects entities to each other through typed relationships. A customer WORKS_AT a company. A company ACQUIRED another company. A contact REPORTS_TO a manager. Without relationship mapping, you know entities exist but not how they connect. With it, you can answer questions like "show me all customers where our main contact recently changed jobs."
The biggest mistakes: over-merging records that should stay separate (two John Smiths become one), under-merging records that are the same entity (Bob and Robert stay separate), matching on unstable attributes like phone numbers that change frequently, and not tracking the sources of merged data. Test matching rules on known duplicates before running at scale.
Deduplication focuses on finding duplicates. Record merging focuses on combining them. Deduplication decides "these two records are the same person." Record merging decides "which email to keep, which address is more recent, how to combine purchase history." You need both. Finding duplicates without a merge strategy leaves you with a list of problems. Merging without deduplication means missing duplicates.
Use deterministic matching when you have reliable unique identifiers like email addresses or account numbers, and when you need to explain every match decision for compliance. Use probabilistic matching when data quality varies, identifiers are incomplete, or you need to catch fuzzy matches. Many systems use both: deterministic for high-confidence matches, probabilistic for the rest.
Run deduplication at ingest, not just as a periodic cleanup. When new records enter, check against existing data before creating new entities. Set up blocking rules to quickly identify potential matches. Monitor duplicate rates as a data quality metric. If duplicates keep appearing, trace them back to the source system or process creating them.
Start with deduplication to clean existing data. Add entity resolution when you need to match across systems. Implement record merging to combine matched records. Add master data management when you need governance and a single source of truth. Finish with relationship mapping to connect your unified entities. Each layer builds on the previous one.
Have a different question? Let's talk