Your marketing team has campaign data in Google Sheets. Sales has pipeline exports in CSV files. Customer success has survey responses in yet another format.
Someone asks 'can we see which campaigns drove our best customers?'
Nobody even knows where all the data lives, let alone how to connect it.
You need one place where everything lands first.
FOUNDATIONAL - The raw material that feeds everything downstream.
A data lake is storage that doesn't care what you throw into it. CSV files, JSON exports, images, PDFs, log files, API responses. It all goes in exactly as it arrived, in its original format. No transformation, no schema, no questions asked.
The difference from a database: you're not structuring the data when you store it. You're preserving it. The structure comes later, when you actually need to use it. This means you never lose fidelity, and you can always go back to the original.
Store everything raw. Transform it when you need it. This way you never throw away data you might need later, and you can always reprocess when requirements change.
Data lakes solve a universal problem: how do you collect information from everywhere without forcing it all into the same shape upfront?
Ingest raw, store raw, transform on read. Data arrives in whatever format it comes in. Storage preserves the original. Processing happens only when someone actually needs to use it.
Click to add data sources. Watch the "forced schema" approach lose data while the data lake keeps everything.
Each source has different fields. Watch what happens when you try to force them into one schema.
Trying to fit everything into: id, name, email, source, date, amount
| id | name | source | date | amount | Lost ⚠️ | |
|---|---|---|---|---|---|---|
| Q3 Webinar | Q3 Webinar | NULL | HubSpot Campaigns | 2024-09-15 | 4500 | 0 fields |
utm_source, metadata, tags, satisfaction_rating... gone forever.
Store raw, apply schema when needed
{
"campaign_name": "Q3 Webinar",
"utm_source": "linkedin",
"clicks": 2847,
"spend": 4500,
"date": "2024-09-15"
}Every field preserved. Query what you need, when you need it.
From raw to refined
Most data lakes use zones: Raw (exactly as received), Cleaned (validated and deduplicated), and Curated (transformed for specific uses). Data moves through zones as it gets processed.
Finding what you stored
Without a catalog, your data lake becomes a data swamp. Metadata tracks what each file is, where it came from, when it arrived, and who owns it. Think of it as the library's card catalog.
Structure when you need it
Unlike databases that enforce schema when writing, data lakes apply schema when reading. The same raw file can be read with different schemas for different purposes. You're not locked into one interpretation.
Your CEO asks this at the quarterly meeting. Campaign data is in HubSpot, customer info in Salesforce, retention metrics in your product database. Without a central data lake, this question takes weeks. With one, you have the answer by lunch.
Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed
Animated lines show direct connections · Hover for detailsTap for details · Click to learn more
Foundation layer - no upstream dependencies
You dump 50 CSV exports into a folder. Six months later, nobody remembers what 'export_final_v3_fixed.csv' contains or where it came from. Now you have 50 mystery files that might be important or might be garbage.
Instead: Tag every file with source, date, owner, and description when it lands. Automate this at ingestion.
Someone needs real-time dashboards, so you point Tableau directly at the data lake. Performance tanks because data lakes are optimized for batch processing, not interactive queries. Now everyone blames the data lake.
Instead: Data lakes are for storage and batch processing. Serve analytics from a data warehouse or specialized database.
Marketing, sales, and engineering all have write access. After a year, you have 47 different date formats, files with no naming convention, and duplicate datasets that nobody knows are duplicates.
Instead: Define ingestion standards. Use automated validation. Have clear ownership for each data domain.
You've learned how raw data storage works and why schema-on-read matters. The natural next step is understanding how data actually gets into the lake and how to process it once it's there.