You're merging customer lists from three systems. One has 'United States', another has 'US', and the third has 'USA'. Phone numbers come as '(555) 123-4567', '555-123-4567', and '5551234567'.
You try to find duplicates, but 'John Smith' in one system doesn't match 'JOHN SMITH' in another.
Your 'unified' customer database is a mess of inconsistent formats that can't be searched, compared, or trusted.
Data that means the same thing should look the same thing.
LAYER 1 - Normalization makes data comparable by enforcing consistent formats.
Normalization is the process of transforming data into a standard format. Dates become ISO 8601. Phone numbers become E.164. Country names become ISO codes. Text gets consistent casing. The chaos of different source systems becomes order.
It's not about changing what the data means - it's about changing how it's represented. 'USA', 'United States', and 'US' all mean the same country. Normalization picks one representation and converts everything to match.
The goal is interoperability: data from any source can be compared, merged, searched, and processed using the same logic because it follows the same format.
Normalization solves a universal problem: how do you make data from different sources speak the same language?
Identify the field type (date, phone, address, currency). Apply the appropriate standard format. Handle edge cases and invalid values gracefully. Store both the original and normalized values when auditing matters. This pattern applies whether you're normalizing names, addresses, currencies, or any structured data.
These 6 customer records look different. Enable normalization to see how many are actually the same person.
| Source | Name | Phone | Country | |
|---|---|---|---|---|
| CRM A | John Smith | (555) 123-4567 | United States | John.Smith@email.com |
| CRM B | JOHN SMITH | 555-123-4567 | USA | john.smith@email.com |
| CRM C | john smith | +1 5551234567 | US | JOHN.SMITH@EMAIL.COM |
| CRM A | Jane Doe | 555.987.6543 | United States of America | jane@company.org |
| CRM B | Jane DOE | (555) 987-6543 | U.S.A. | JANE@company.org |
| CRM A | Bob Wilson | 555-555-1234 | Canada | bob@wilson.ca |
Fix the format, not the meaning
Applies consistent formatting rules: lowercase text, remove extra whitespace, standardize punctuation. Phone numbers become digits only. Dates become ISO format. Quick, deterministic, and handles most cases.
Map values to canonical forms
Uses lookup tables to map variants to standard values. 'USA', 'U.S.A.', 'United States' all become 'US'. Requires maintaining reference data but catches semantic equivalents that format rules miss.
Handle messy, ambiguous data
Uses machine learning or fuzzy matching for data that's too messy for rules or lookups. Handles typos, abbreviations, and creative spellings. 'Califrnia' becomes 'California'. More powerful but less predictable.
Sales teams from three acquired companies used different CRMs. A naive merge found zero duplicates across 50,000 contacts - impossible. After normalization, 12,000 duplicates emerged: same people, different formats. Now the unified database is clean and searchable.
Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed
Animated lines show direct connections · Hover for detailsTap for details · Click to learn more
You normalized 'Bob Smith Jr.' to 'bob smith jr' and threw away the original. Now you can't tell if it was 'Jr.', 'Jr', or 'Junior'. You can't regenerate the proper display format. And if your normalization was wrong, the original is gone forever.
Instead: Store both original and normalized values. Normalize on read or in a separate column. Never destroy source data.
You normalized phone numbers on input, stripping the country code because 'all our customers are in the US'. Then you expanded internationally. Now you have millions of phone numbers with no country code, and no way to know which country they're from.
Instead: Keep data in its richest form as long as possible. Normalize at the point of use, not on ingestion. Preserve context.
You normalized dates to MM/DD/YYYY because that's what your US system uses. Then EU data arrived with DD/MM/YYYY. '03/04/2024' - is that March 4th or April 3rd? You don't know, and now neither does your database.
Instead: Use unambiguous formats (ISO 8601 for dates). Capture timezone and locale with the data. When in doubt, ask the source.
You've learned how to standardize data formats. The natural next step is validation - checking that normalized data meets your quality requirements before it enters your systems.