KnowledgeLayer 0Data Storage & Persistence

Data Lakes

Your marketing team has campaign data in Google Sheets. Sales has pipeline exports in CSV files. Customer success has survey responses in yet another format.

Someone asks 'can we see which campaigns drove our best customers?'

Nobody even knows where all the data lives, let alone how to connect it.

You need one place where everything lands first.

10 min read

intermediate

Relevant If You're

Collecting data from many different sources

Storing raw data before you know how you'll use it

Training AI models on historical data

FOUNDATIONAL - The raw material that feeds everything downstream.

Where This Sits

Part of the Foundation Layer

Layer 0

Foundation

Databases (Relational)Databases (Document/NoSQL)File Storage Data Lakes

Explore all of Layer 0

What It Is

A central repository where you dump everything before processing it

A data lake is storage that doesn't care what you throw into it. CSV files, JSON exports, images, PDFs, log files, API responses. It all goes in exactly as it arrived, in its original format. No transformation, no schema, no questions asked.

The difference from a database: you're not structuring the data when you store it. You're preserving it. The structure comes later, when you actually need to use it. This means you never lose fidelity, and you can always go back to the original.

Store everything raw. Transform it when you need it. This way you never throw away data you might need later, and you can always reprocess when requirements change.

The Lego Block Principle

Data lakes solve a universal problem: how do you collect information from everywhere without forcing it all into the same shape upfront?

The core pattern:

Ingest raw, store raw, transform on read. Data arrives in whatever format it comes in. Storage preserves the original. Processing happens only when someone actually needs to use it.

Where else this applies:

Logging systems - Events arrive in various formats, stored as-is, queried when debugging.

Research archives - Papers, datasets, notes all preserved in original form.

Media libraries - Photos, videos, audio stored raw, transcoded only when needed.

Email archives - Messages stored complete, parsed only when searched.

🎮 Interactive: Add Data Sources, Watch Schemas Clash

Add sources and watch data disappear (or not)

Click to add data sources. Watch the "forced schema" approach lose data while the data lake keeps everything.

Add a data source:

Each source has different fields. Watch what happens when you try to force them into one schema.

Data Sources

Unique Fields

Fields Lost (Schema)

Fields Lost (Lake)

Forced Schema Approach

0% data lost

Trying to fit everything into: id, name, email, source, date, amount

id	name	email	source	date	amount	Lost ⚠️
Q3 Webinar	Q3 Webinar	NULL	HubSpot Campaigns	2024-09-15	4500	0 fields

utm_source, metadata, tags, satisfaction_rating... gone forever.

Data Lake Approach

100% preserved

Store raw, apply schema when needed

HubSpot CampaignsCSV

{
  "campaign_name": "Q3 Webinar",
  "utm_source": "linkedin",
  "clicks": 2847,
  "spend": 4500,
  "date": "2024-09-15"
}

campaign_nameutm_sourceclicksspenddate

Every field preserved. Query what you need, when you need it.

Try it: Click to add more data sources above. Watch the schema approach struggle while the lake handles everything.

How It Works

Three concepts that make data lakes work

Zones & Organization

From raw to refined

Most data lakes use zones: Raw (exactly as received), Cleaned (validated and deduplicated), and Curated (transformed for specific uses). Data moves through zones as it gets processed.

Clear progression from chaos to usable

Need governance to prevent zone sprawl

Metadata & Cataloging

Finding what you stored

Without a catalog, your data lake becomes a data swamp. Metadata tracks what each file is, where it came from, when it arrived, and who owns it. Think of it as the library's card catalog.

Data is discoverable and traceable

Cataloging requires discipline

Schema-on-Read

Structure when you need it

Unlike databases that enforce schema when writing, data lakes apply schema when reading. The same raw file can be read with different schemas for different purposes. You're not locked into one interpretation.

Flexibility to reinterpret data later

Queries are more complex than SQL

Connection Explorer

"Which campaigns drove customers who stayed longest?"

Your CEO asks this at the quarterly meeting. Campaign data is in HubSpot, customer info in Salesforce, retention metrics in your product database. Without a central data lake, this question takes weeks. With one, you have the answer by lunch.

Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed

Campaign ROI Report

Outcome

React Flow

Foundation

Data Infrastructure

Intelligence

Understanding

Outcome

Animated lines show direct connections · Hover for detailsTap for details · Click to learn more

Upstream (Requires)

Foundation layer - no upstream dependencies

Downstream (Enables)

Ingestion Patterns Data Mapping Embedding Generation

Common Mistakes

What turns a data lake into a data swamp

Don't skip metadata because 'we'll remember'

You dump 50 CSV exports into a folder. Six months later, nobody remembers what 'export_final_v3_fixed.csv' contains or where it came from. Now you have 50 mystery files that might be important or might be garbage.

Instead: Tag every file with source, date, owner, and description when it lands. Automate this at ingestion.

Don't treat the data lake as your only storage

Someone needs real-time dashboards, so you point Tableau directly at the data lake. Performance tanks because data lakes are optimized for batch processing, not interactive queries. Now everyone blames the data lake.

Instead: Data lakes are for storage and batch processing. Serve analytics from a data warehouse or specialized database.

Don't let everyone dump without governance

Marketing, sales, and engineering all have write access. After a year, you have 47 different date formats, files with no naming convention, and duplicate datasets that nobody knows are duplicates.

Instead: Define ingestion standards. Use automated validation. Have clear ownership for each data domain.

Next Steps

Now that you understand data lakes

You've learned how raw data storage works and why schema-on-read matters. The natural next step is understanding how data actually gets into the lake and how to process it once it's there.

Recommended

Ingestion Patterns

How data moves from sources into your systems

Data Lakes

Your marketing team has campaign data in Google Sheets. Sales has pipeline exports in CSV files. Customer success has survey responses in yet another format.

Someone asks 'can we see which campaigns drove our best customers?'

Nobody even knows where all the data lives, let alone how to connect it.

You need one place where everything lands first.

10 min read

intermediate

A central repository where you dump everything before processing it

Store everything raw. Transform it when you need it. This way you never throw away data you might need later, and you can always reprocess when requirements change.

Add sources and watch data disappear (or not)

Click to add data sources. Watch the "forced schema" approach lose data while the data lake keeps everything.

Add a data source:

Each source has different fields. Watch what happens when you try to force them into one schema.

Data Sources

Unique Fields

Fields Lost (Schema)

Fields Lost (Lake)

Forced Schema Approach

0% data lost

Trying to fit everything into: id, name, email, source, date, amount

id	name	email	source	date	amount	Lost ⚠️
Q3 Webinar	Q3 Webinar	NULL	HubSpot Campaigns	2024-09-15	4500	0 fields

utm_source, metadata, tags, satisfaction_rating... gone forever.

Data Lake Approach

100% preserved

Store raw, apply schema when needed

HubSpot CampaignsCSV

{
  "campaign_name": "Q3 Webinar",
  "utm_source": "linkedin",
  "clicks": 2847,
  "spend": 4500,
  "date": "2024-09-15"
}

campaign_nameutm_sourceclicksspenddate

Every field preserved. Query what you need, when you need it.

Try it: Click to add more data sources above. Watch the schema approach struggle while the lake handles everything.

Three concepts that make data lakes work

Zones & Organization

From raw to refined

Most data lakes use zones: Raw (exactly as received), Cleaned (validated and deduplicated), and Curated (transformed for specific uses). Data moves through zones as it gets processed.

Clear progression from chaos to usable

Need governance to prevent zone sprawl

Metadata & Cataloging

Finding what you stored

Without a catalog, your data lake becomes a data swamp. Metadata tracks what each file is, where it came from, when it arrived, and who owns it. Think of it as the library's card catalog.

Data is discoverable and traceable

Cataloging requires discipline

Schema-on-Read

Structure when you need it

Flexibility to reinterpret data later

Queries are more complex than SQL

"Which campaigns drove customers who stayed longest?"

Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed

Campaign ROI Report

Outcome

React Flow

Foundation

Data Infrastructure

Intelligence

Understanding

Outcome

Animated lines show direct connections · Hover for detailsTap for details · Click to learn more

Upstream (Requires)

Foundation layer - no upstream dependencies

Downstream (Enables)

Ingestion Patterns Data Mapping Embedding Generation

What turns a data lake into a data swamp

Don't skip metadata because 'we'll remember'

Instead: Tag every file with source, date, owner, and description when it lands. Automate this at ingestion.

Don't treat the data lake as your only storage

Instead: Data lakes are for storage and batch processing. Serve analytics from a data warehouse or specialized database.

Don't let everyone dump without governance

Marketing, sales, and engineering all have write access. After a year, you have 47 different date formats, files with no naming convention, and duplicate datasets that nobody knows are duplicates.

Instead: Define ingestion standards. Use automated validation. Have clear ownership for each data domain.