KnowledgeLayer 0Data Storage & Persistence

File Storage

You have contracts as PDFs, product photos as JPEGs, and customer uploads scattered across email attachments, Dropbox folders, and that one guy's desktop.

Someone asks for the signed contract from last March. You spend 20 minutes searching through folders named 'Contracts_Final_v2_REAL'.

That file should be one click away, linked to the customer record.

8 min read

beginner

Relevant If You're

Storing documents, images, videos, or any binary data

Needing files accessible across your team or systems

Linking files to database records (contracts to customers)

FOUNDATIONAL - Every system that handles documents, images, or uploads needs file storage.

Where This Sits

Category 0.1: Data Storage & Persistence

Layer 0

Foundation

Databases (Relational)Databases (Document/NoSQL)File Storage Data Lakes

Explore all of Layer 0

What It Is

A place for files that databases can not handle

Databases store structured data: names, dates, numbers. But a signed PDF, a product photo, or a 50MB design file? Those don't fit in database columns. They need somewhere else to live.

File storage is that somewhere. It holds the actual bytes of your files while your database holds metadata about them. Customer record #47 has a 'contract_url' field pointing to the PDF in storage. The database stays fast. The file stays accessible.

Every AI system that works with documents, images, or media needs file storage. It's where the raw material lives before processing turns it into something useful.

The Lego Block Principle

File storage solves a universal problem: how do you store large, unstructured blobs so they're retrievable without slowing down everything else?

The core pattern:

Store the blob separately. Keep a reference (URL, path, or ID) in your structured data. Fetch the blob only when needed. This pattern works whether you're storing a 10KB icon or a 10GB video.

Where else this applies:

CDN delivery - Assets stored once, served from edge locations worldwide.

Email attachments - Message metadata in DB, actual files in blob storage.

Version control - Git stores file content as blobs, references via SHA hashes.

Caching - Expensive computations stored as files, retrieved by key.

🎮 Interactive: Upload Files, Watch the Difference

Upload files and watch your database balloon

Click "Upload File" to add documents. Compare what happens when files live in your database vs. separate storage.

Each click simulates uploading a contract, photo, or document.

Files Uploaded

0.0 MB

Total File Size

1 min

DB Backup (with BLOBs)

1 sec

DB Backup (metadata only)

Files in Database (BLOBs)

0.0 MB in DB

File	BLOB Data	Customer
No files yet. Click "Upload File" above.

Every file bloats the database. Backups take forever.

Separate File Storage

0.0 KB in DB

documents (database)

id	file_url	customer_id
No files yet

S3 bucket (file storage)

Empty bucket

Database stays tiny. Files scale independently.

Try it: Click "Upload File" a few times and watch how quickly database backups slow down when files live inside the database.

How It Works

Three approaches, different trade-offs

Cloud Object Storage

S3, GCS, Azure Blob - infinite scale, pay per use

Upload files to a cloud bucket. Get a URL back. Files are replicated across data centers automatically. You pay for what you store and what you transfer. Most AI systems use this.

Pro: Scales infinitely, highly durable, no maintenance

Con: Egress costs add up, vendor lock-in risk

Local/Network File System

Traditional folders on servers or NAS

Store files on your own servers or network-attached storage. You control the hardware. Good for sensitive data that can't leave your network or when you need very low latency.

Pro: Full control, no per-request costs, low latency

Con: You handle backups, scaling, and hardware failures

Database BLOBs

Store files directly in database columns

Some databases let you store binary data directly. Simple for small files since everything is in one place. But your database backups balloon and queries slow down as files grow.

Pro: Simple, single system, transactional with other data

Con: Kills database performance at scale

Connection Explorer

"Find every contract we signed with Acme Corp"

Your account manager asks this before a renewal meeting. Without organized file storage, you're searching email, Dropbox, and desktop folders. This flow returns every document in seconds, with preview links and signing dates.

Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed

Query Interface

Document Results

Outcome

React Flow

Foundation

Data Infrastructure

Intelligence

Understanding

Outcome

Animated lines show direct connections · Hover for detailsTap for details · Click to learn more

Upstream (Requires)

Foundation layer - no upstream dependencies

Downstream (Enables)

Ingestion Patterns OCR/Document Parsing Embedding Generation

Common Mistakes

What breaks when file storage goes wrong

Don't store files in database columns at scale

It works fine with 100 small images. Then you have 10,000 product photos and your database backup takes 8 hours. Every query gets slower because the database is shuffling gigabytes of blob data.

Instead: Use object storage for files. Store only the URL in your database.

Don't use predictable public URLs for sensitive files

You store contracts at /files/contract-{id}.pdf. Someone guesses IDs and downloads every contract. You just leaked sensitive customer data because the URLs were public and predictable.

Instead: Use signed URLs with expiration. Or store files privately and serve through authenticated endpoints.

Don't forget to handle file deletion

Customer deletes their account. You remove the database record. But their uploaded files stay in storage forever. Storage costs climb. You might be violating GDPR.

Instead: Implement cascade deletion. When a record is deleted, queue deletion of associated files.

What's Next

Now that you understand file storage

You've learned how files live separately from your database and why that matters. The natural next step is understanding how to get content out of those files.

Recommended Next

OCR/Document Parsing

How to extract text and structure from PDFs, images, and documents

Back to Learning Hub

File Storage

You have contracts as PDFs, product photos as JPEGs, and customer uploads scattered across email attachments, Dropbox folders, and that one guy's desktop.

Someone asks for the signed contract from last March. You spend 20 minutes searching through folders named 'Contracts_Final_v2_REAL'.

That file should be one click away, linked to the customer record.

8 min read

beginner

A place for files that databases can not handle

Databases store structured data: names, dates, numbers. But a signed PDF, a product photo, or a 50MB design file? Those don't fit in database columns. They need somewhere else to live.

Every AI system that works with documents, images, or media needs file storage. It's where the raw material lives before processing turns it into something useful.

Upload files and watch your database balloon

Click "Upload File" to add documents. Compare what happens when files live in your database vs. separate storage.

Each click simulates uploading a contract, photo, or document.

Files Uploaded

0.0 MB

Total File Size

1 min

DB Backup (with BLOBs)

1 sec

DB Backup (metadata only)

Files in Database (BLOBs)

0.0 MB in DB

File	BLOB Data	Customer
No files yet. Click "Upload File" above.

Every file bloats the database. Backups take forever.

Separate File Storage

0.0 KB in DB

documents (database)

id	file_url	customer_id
No files yet

S3 bucket (file storage)

Empty bucket

Database stays tiny. Files scale independently.

Try it: Click "Upload File" a few times and watch how quickly database backups slow down when files live inside the database.

Three approaches, different trade-offs

Cloud Object Storage

S3, GCS, Azure Blob - infinite scale, pay per use

Upload files to a cloud bucket. Get a URL back. Files are replicated across data centers automatically. You pay for what you store and what you transfer. Most AI systems use this.

Pro: Scales infinitely, highly durable, no maintenance

Con: Egress costs add up, vendor lock-in risk

Local/Network File System

Traditional folders on servers or NAS

Store files on your own servers or network-attached storage. You control the hardware. Good for sensitive data that can't leave your network or when you need very low latency.

Pro: Full control, no per-request costs, low latency

Con: You handle backups, scaling, and hardware failures

Database BLOBs

Store files directly in database columns

Some databases let you store binary data directly. Simple for small files since everything is in one place. But your database backups balloon and queries slow down as files grow.

Pro: Simple, single system, transactional with other data

Con: Kills database performance at scale

"Find every contract we signed with Acme Corp"

Hover over any component to see what it does and why it's neededTap any component to see what it does and why it's needed

Query Interface

Document Results

Outcome

React Flow

Foundation

Data Infrastructure

Intelligence

Understanding

Outcome

Animated lines show direct connections · Hover for detailsTap for details · Click to learn more

What breaks when file storage goes wrong

Don't store files in database columns at scale

It works fine with 100 small images. Then you have 10,000 product photos and your database backup takes 8 hours. Every query gets slower because the database is shuffling gigabytes of blob data.

Instead: Use object storage for files. Store only the URL in your database.

Don't use predictable public URLs for sensitive files

You store contracts at /files/contract-{id}.pdf. Someone guesses IDs and downloads every contract. You just leaked sensitive customer data because the URLs were public and predictable.

Instead: Use signed URLs with expiration. Or store files privately and serve through authenticated endpoints.

Don't forget to handle file deletion

Customer deletes their account. You remove the database record. But their uploaded files stay in storage forever. Storage costs climb. You might be violating GDPR.

Instead: Implement cascade deletion. When a record is deleted, queue deletion of associated files.