Design a Distributed File Storage System — Complete System Design Walkthrough
File storage systems like Dropbox, Google Drive, and OneDrive are among the most popular system design interview questions. They test your ability to reason about large binary data, synchronization across devices, conflict resolution, and storage efficiency at massive scale. In this walkthrough, we will design a distributed file storage system from scratch, following a structured 6-stage approach that mirrors how top candidates tackle these problems in real interviews.
Stage 1: Requirements Gathering
Start by clarifying scope. File storage systems can range from simple blob stores to full collaboration platforms. Spending 3-5 minutes on requirements shows maturity and prevents wasted effort on features that are out of scope.
Functional Requirements
- File upload and download — Users can upload files from any device and download them later. Support files up to 10 GB.
- Automatic sync — Changes made on one device are automatically propagated to all other devices linked to the same account.
- File sharing — Users can share files or folders with other users via link or direct permissions (view, edit).
- File versioning — The system maintains a history of file versions. Users can view and restore previous versions.
- Offline support — The desktop/mobile sync client works offline and reconciles changes when connectivity is restored.
Non-Functional Requirements
- Scale: 500 million registered users, 100 million daily active users.
- Storage: Average 200 files per user, 10 GB average storage per user. Total: ~5 exabytes of logical storage.
- Upload throughput: Assume 10% of DAU upload at least one file per day. At an average file size of 2 MB, that is 10M uploads/day, or ~115 uploads/second average (~350/s peak).
- Download throughput: Reads dominate writes. Assume 5x read-to-write ratio: ~1,750 downloads/second at peak.
- Availability: 99.9% — users tolerate brief sync delays but not data loss.
- Durability: 99.999999999% (11 nines) — file loss is unacceptable.
- Consistency: Eventual consistency for sync across devices is acceptable, but a single device must see its own writes immediately (read-your-writes consistency).
Interview tip: Explicitly computing upload/download QPS from DAU shows the interviewer you can translate business requirements into engineering constraints. This is one of the strongest signals in a system design interview.
Stage 2: API Design
The API separates metadata operations (lightweight JSON) from data operations (large binary uploads/downloads). This separation is fundamental because metadata requests hit a database while data requests go to blob storage.
Core Endpoints
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /v1/files/upload/init | Initialize a chunked upload. Returns uploadId and pre-signed URLs for each chunk. |
| PUT | /v1/files/upload/{uploadId}/chunks/{index} | Upload a single chunk (4 MB default). Client uploads directly to blob storage via pre-signed URL. |
| POST | /v1/files/upload/{uploadId}/complete | Finalize upload. Server assembles metadata, verifies chunk checksums, creates file record. |
| GET | /v1/files/{fileId}/download | Returns a pre-signed CDN URL for the file (or individual chunks for resumable download). |
| GET | /v1/files/{fileId}/versions | List all versions of a file with timestamps, sizes, and who made the change. |
| POST | /v1/files/{fileId}/versions/{versionId}/restore | Restore a previous version (creates a new version pointing to old chunks). |
| GET | /v1/sync?cursor={cursor} | Long-poll or SSE endpoint. Returns all changes since the cursor (new, modified, deleted files). |
| POST | /v1/shares | Share a file/folder. Body: { fileId, targetUserId, permission: "view" | "edit" } |
Key Design Decisions
Pre-signed URLs for data transfer. Clients upload chunks directly to blob storage (S3) using pre-signed URLs generated by the API server. This offloads bandwidth from our servers entirely — the API server only handles lightweight metadata requests. Downloads similarly go through a CDN backed by blob storage.
Chunked uploads. Every file, regardless of size, is split into fixed-size chunks (4 MB). This enables resumable uploads (only re-upload failed chunks), parallel uploads (multiple chunks simultaneously), and deduplication (identical chunks across files are stored once).
Design decision: Why 4 MB chunks? Smaller chunks increase metadata overhead and HTTP request count. Larger chunks reduce deduplication effectiveness and make retries more expensive. 4 MB is a widely used sweet spot (Dropbox uses 4 MB, Google Drive uses similar sizes internally).
Stage 3: Data Model
The data model has two distinct layers: a metadata database for file hierarchy, ownership, and versioning, and content-addressable blob storage for the actual file data.
Metadata Database (PostgreSQL)
Files Table
| Column | Type | Description |
|---|---|---|
file_id | UUID (PK) | Unique identifier |
owner_id | UUID (FK) | User who owns the file |
parent_folder_id | UUID (FK, nullable) | Parent folder (null for root) |
name | varchar(255) | File or folder name |
is_folder | boolean | Whether this entry is a folder |
latest_version_id | UUID (FK) | Points to current version |
size_bytes | bigint | Current file size |
status | enum | active, deleted, uploading |
created_at | timestamp | |
updated_at | timestamp | Used for sync cursor |
File Versions Table
| Column | Type | Description |
|---|---|---|
version_id | UUID (PK) | Unique version identifier |
file_id | UUID (FK) | Parent file |
version_number | integer | Monotonically increasing per file |
chunk_hashes | text[] | Ordered array of chunk SHA-256 hashes |
size_bytes | bigint | Total size of this version |
modified_by | UUID (FK) | User who created this version |
created_at | timestamp |
Chunks Table
| Column | Type | Description |
|---|---|---|
chunk_hash | char(64) (PK) | SHA-256 hash of chunk content |
size_bytes | integer | Chunk size |
blob_key | varchar(255) | Key in blob storage (S3) |
reference_count | integer | Number of file versions referencing this chunk |
created_at | timestamp |
Content-Addressable Storage for Deduplication
This is one of the most important design decisions in a file storage system. Instead of storing files by name or ID, we store chunks by their content hash (SHA-256). Two identical chunks — whether from the same user editing a file or from two different users uploading the same document — map to the same hash and are stored only once.
The deduplication flow works as follows:
- Client computes SHA-256 of each chunk before uploading.
- Client sends the list of chunk hashes to
/v1/files/upload/init. - Server checks which hashes already exist in the Chunks table.
- Server returns pre-signed URLs only for chunks that do not yet exist.
- Client skips uploading duplicate chunks entirely.
At enterprise scale, deduplication typically saves 30-50% of raw storage. With 5 exabytes of logical data, this translates to saving 1.5-2.5 exabytes of physical storage — a massive cost reduction.
The reference_count field enables garbage collection: when a file version is deleted, we decrement the reference count for each of its chunks. Chunks with zero references are eligible for deletion (typically run as a background job with a grace period to handle race conditions).
Stage 4: High-Level Architecture
Let us trace the three core flows: upload, sync, and download.
Chunked Upload Flow
- The Client Sync Agent detects a file change (via filesystem watcher on desktop, or explicit upload on web).
- Client splits the file into 4 MB chunks and computes SHA-256 for each chunk.
- Client calls
POST /v1/files/upload/initwith the file metadata and chunk hash list. - The Metadata Service checks the Chunks table for existing hashes. For new chunks, it generates pre-signed S3 upload URLs. It creates a file record with status
uploading. - Client uploads new chunks in parallel (up to 4 concurrent uploads) directly to Blob Storage using the pre-signed URLs.
- Client calls
POST /v1/files/upload/complete. The Metadata Service verifies all chunks are present via S3 HEAD requests, creates a new file version record, updates the file status toactive, and increments reference counts for all chunks. - The Metadata Service publishes a
file_changedevent to the Notification Service via the message queue.
Sync Flow
- The Client Sync Agent maintains a persistent connection to the Notification Service via long-polling or Server-Sent Events (SSE).
- When the Notification Service receives a
file_changedevent from the queue, it checks which users and devices need to be notified (based on file ownership and sharing permissions). - The Notification Service pushes the change event to the appropriate clients.
- Each client calls
GET /v1/sync?cursor={cursor}to fetch the full list of changes since its last sync point. The cursor is a timestamp-based token. - For each changed file, the client downloads only the chunks it does not already have locally (delta sync).
Download Flow
- Client requests
GET /v1/files/{fileId}/download. - The Metadata Service retrieves the latest version's chunk list and generates pre-signed CDN URLs for each chunk.
- Client downloads chunks in parallel from the CDN.
- Client reassembles the file locally from its chunks.
For small files (under one chunk), the flow simplifies to a single download. For large files, chunked parallel downloads significantly improve throughput. The CDN caches popular files at edge locations, reducing latency for frequently accessed shared files.
Stage 5: Deep Dive
We will go deep on two critical subsystems: sync conflict resolution and chunked upload with deduplication.
Deep Dive 1: Sync Conflict Resolution
Conflicts occur when two devices modify the same file while one or both are offline, or when edits happen concurrently before sync completes. This is one of the hardest problems in file storage systems.
Detection
Each file has a latest_version_id. When a client uploads a new version, it includes the expected_version_id (the version it based its edit on). The server performs an optimistic concurrency check:
- If
expected_version_id == latest_version_id: no conflict. Accept the upload and increment the version. - If
expected_version_id != latest_version_id: conflict detected. Another device uploaded a change first.
This is essentially optimistic locking on the file version, similar to an HTTP If-Match header with ETags.
Resolution Strategy
There are three common strategies. We use a hybrid approach:
- Last-writer-wins (LWW): Simple but causes data loss. Not acceptable for user files.
- Automatic merge: Possible for some file types (text files via operational transforms or CRDTs), but impractical for binary files like images or PDFs.
- Conflict copy (our approach): When a conflict is detected, the server accepts the new upload as a separate "conflict copy" with a name like
Budget (conflicted copy - Alice's MacBook - 2026-03-12).xlsx. Both versions are preserved. The user resolves the conflict manually.
This is the same strategy used by Dropbox and is the safest general-purpose approach. The user never loses data, and the resolution is explicit rather than implicit.
Implementation Details
- Server detects conflict via version mismatch.
- Server creates a new file entry (not a new version of the same file) with the conflict naming convention.
- Server notifies both devices: the original device sees the remote change; the conflicting device sees its file renamed.
- The Notification Service sends a special
conflictevent type so the client can display a user-facing notification.
Reducing Conflicts
Most conflicts can be avoided by keeping the sync interval tight. Our sync client uses a combination of filesystem watchers for immediate detection and a 30-second polling fallback. The faster we sync, the smaller the window for concurrent edits. Additionally, we can implement advisory locks: when a user opens a file for editing, the client requests a soft lock. Other clients see a "currently being edited by Alice" indicator, discouraging concurrent edits without preventing them.
Deep Dive 2: Chunked Upload and Deduplication
Chunked upload and deduplication are deeply intertwined. Let us walk through the end-to-end implementation in detail.
Client-Side Chunking
The sync client uses a rolling hash (Rabin fingerprint) for content-defined chunking rather than fixed-size splits. With fixed 4 MB boundaries, inserting a single byte at the beginning of a file shifts every chunk boundary, invalidating all hashes. Content-defined chunking uses the data itself to determine boundaries:
- Slide a window over the file computing a rolling hash.
- When the hash modulo a target value equals zero, mark a chunk boundary.
- This produces variable-size chunks (average 4 MB, min 1 MB, max 8 MB).
The benefit: inserting or modifying data in the middle of a file only affects 1-2 chunks. All other chunks retain their original hashes, so they do not need to be re-uploaded. For a 1 GB file where the user edits a paragraph in the middle, we might re-upload only 4-8 MB instead of the entire file.
Server-Side Deduplication Check
When the client sends its chunk hash list to /v1/files/upload/init, the server performs a batch lookup against the Chunks table:
SELECT chunk_hash FROM chunks WHERE chunk_hash IN ($1, $2, ..., $n)Any hash found in the database represents a chunk that already exists in blob storage. The server returns pre-signed URLs only for missing chunks. This single query can eliminate a large percentage of uploads, especially for:
- File versioning: Editing a document typically changes only a few chunks.
- Cross-user dedup: Common files (OS images, popular libraries, shared templates) are stored once.
- Backup scenarios: Re-syncing a device reuses all existing chunks.
Integrity Verification
After all chunks are uploaded, the complete endpoint verifies integrity:
- For each chunk, compute SHA-256 from the S3 object and compare with the declared hash.
- Verify total reassembled size matches the declared file size.
- If verification fails, mark the upload as failed and return an error. The client retries only the corrupted chunks.
This end-to-end integrity check ensures that network corruption, S3 write failures, or client bugs cannot result in silently corrupted files.
Stage 6: Scaling and Trade-Offs
Storage Cost Optimization
At 5 exabytes of logical storage, cost is a first-order concern:
- Deduplication saves 30-50%, bringing physical storage to ~3 EB.
- Tiered storage: Files not accessed in 30 days move to S3 Infrequent Access (40% cheaper). Files not accessed in 90 days move to S3 Glacier (80% cheaper). With typical access patterns, 70-80% of files are cold, reducing effective storage cost dramatically.
- Compression: Compressible file types (text, code, documents) are compressed at the chunk level before storage. Binary files (images, videos) that do not compress well are stored as-is. Average 15-20% additional savings on compressible content.
Metadata Database Scaling
With 100 billion files (500M users x 200 files), the metadata DB is under significant pressure. Scaling strategies:
- Sharding by user_id: Each user's file tree lives on a single shard, keeping folder listing queries local. We use consistent hashing with virtual nodes across 256 PostgreSQL shards.
- Read replicas: Sync polling queries (
GET /v1/sync) hit read replicas. Since we accept eventual consistency for cross-device sync, replica lag of a few seconds is acceptable. - Caching: Redis caches the file tree for active users. Cache invalidation is straightforward because we control all write paths — every metadata write invalidates the relevant cache entries.
- Separate the Chunks table: The Chunks table has a different access pattern (hash lookups, reference count updates) than the Files table (tree traversals, sync queries). It can live on a separate set of shards optimized for point lookups.
Consistency vs. Availability for Sync
This is the central trade-off in a distributed file storage system:
| Aspect | Our Choice | Rationale |
|---|---|---|
| Single-device writes | Strong consistency | A user must see their own upload immediately. Write to primary, read from primary for the uploading device. |
| Cross-device sync | Eventual consistency | A few seconds of delay before a file appears on another device is acceptable. This lets us use read replicas and async notification. |
| Sharing | Eventual consistency | When Alice shares a file with Bob, a 5-10 second delay before Bob sees it is fine. |
| Conflict detection | Strong consistency | The version check on upload must be linearizable to prevent silent data loss. This is a single-row operation on the primary shard. |
Availability During Partitions
If the metadata DB is unreachable, the system degrades gracefully:
- Downloads continue: The CDN serves cached files. Clients with locally cached metadata can still download chunks directly from S3.
- Uploads queue locally: The sync client queues changes locally and retries when connectivity is restored. This is identical to the offline support flow.
- Sync pauses: Clients continue working with their last-known state and reconcile when the metadata DB recovers.
Trade-Off Summary
| Decision | Choice | Trade-Off |
|---|---|---|
| Chunk size | Content-defined, ~4 MB avg | Better dedup and delta sync, but more complex client logic and higher metadata overhead than fixed-size chunks. |
| Dedup scope | Global (cross-user) | Maximum storage savings, but requires a global chunk index and careful garbage collection with reference counting. |
| Conflict resolution | Conflict copy | Zero data loss, but creates user-facing noise. Better than LWW (data loss) or merge (complexity, binary-incompatible). |
| Metadata store | Sharded PostgreSQL | Strong consistency for writes, familiar operational model. Horizontal scaling requires application-level sharding logic. |
| Blob storage | S3 with tiered lifecycle | 11 nines durability, near-infinite scale, low cost for cold data. Retrieval latency for Glacier-tier files is minutes, not milliseconds. |
| Sync mechanism | Long-poll + push notification | Near-real-time sync without persistent WebSocket overhead. Slightly higher latency than WebSocket but simpler to scale. |
Scoring Tips
To score well on a file storage system design question, focus on these areas:
- Nail the chunking rationale. Interviewers expect you to explain why files are chunked (resumability, parallelism, deduplication, delta sync). Mentioning content-defined chunking over fixed-size demonstrates senior-level knowledge.
- Show the dedup math. Calculate storage savings. If you estimate 30-50% dedup ratio on 5 EB of data, state the dollar impact. This connects engineering decisions to business value.
- Do not hand-wave sync conflicts. This is where most candidates fall short. Describe the version-based conflict detection mechanism, explain why last-writer-wins is insufficient, and propose a concrete resolution strategy.
- Separate metadata from data paths. Making this architectural split early shows you understand that a 4 MB chunk upload and a file listing query have fundamentally different performance characteristics and should not compete for the same resources.
- Discuss storage tiers proactively. At exabyte scale, hot/warm/cold tiering is not optional — it is existential for the business. Mentioning lifecycle policies and their cost impact signals operational maturity.
- Address failure modes. What happens if S3 is temporarily unavailable? (Client retries, local queue.) What if a chunk upload partially fails? (Resume from the last successful chunk.) What if the metadata DB loses a shard? (Replica promotion, reconciliation.) Production thinking wins interviews.
Practice delivering this architecture in 35 minutes, spending roughly 5 minutes per stage. If you want to simulate the experience with real-time feedback on your pacing, structure, and technical depth, Hoppers AI offers mock system design interviews that follow this exact 6-stage framework.