Design a Video Streaming Platform — Complete System Design Walkthrough
Video streaming is one of the most demanding system design problems you will encounter in interviews. It combines massive storage requirements, compute-intensive processing pipelines, real-time delivery under strict latency constraints, and a global CDN strategy. In this walkthrough, we will design a platform similar to YouTube or Netflix from scratch, covering each stage of a structured system design interview.
Stage 1: Requirements Gathering
Start by defining scope with your interviewer. Video streaming platforms have enormous surface area, so narrowing early is essential. Spend 3-5 minutes here.
Functional Requirements
- Video upload — Creators upload videos of varying sizes (up to 10 GB). The system accepts the raw file, validates it, and prepares it for streaming.
- Video transcoding — Convert uploaded videos into multiple resolutions (240p, 360p, 480p, 720p, 1080p, 4K) and codecs (H.264, H.265/HEVC, VP9, AV1) for broad device compatibility.
- Adaptive bitrate streaming — Deliver video using HLS or DASH so the player can dynamically switch quality based on the viewer's network conditions.
- Video search — Users can search for videos by title, description, tags, and creator name.
- Recommendations — Personalized video feed based on watch history, likes, and trending content.
- Video playback — Low-latency start (under 2 seconds), seek support, and smooth playback across devices (web, mobile, smart TV).
Non-Functional Requirements
- Scale: 2 billion total users, 500 million DAU, 500 hours of video uploaded per minute, 1 billion video views per day.
- Latency: Video playback start under 2 seconds. Upload acknowledgment (not processing) within seconds.
- Availability: 99.99% for playback. Upload pipeline can tolerate slightly lower availability (99.9%) with retry semantics.
- Durability: Zero data loss on uploaded videos. Once a creator uploads a video, it must never be lost.
- Storage: At 500 hours/minute with an average of 1 GB per hour of raw video, that is roughly 30 TB of raw uploads per day. After transcoding to multiple resolutions, storage multiplies by 5-8x.
Interview tip: Convert your scale numbers into actionable metrics early. 1 billion views/day equals roughly 11,500 video starts per second on average, with peaks of 3-5x during prime time. These numbers directly inform your CDN capacity and origin server sizing.
Stage 2: API Design
A video streaming platform exposes REST APIs for upload and metadata management, and uses streaming protocols (HLS/DASH) for video delivery.
Upload API
| Method | Endpoint | Description |
|---|---|---|
| POST | /v1/videos/upload/init | Initialize multipart upload. Returns uploadId and pre-signed URLs for each chunk. |
| PUT | /v1/videos/upload/{uploadId}/parts/{partNumber} | Upload a single chunk (pre-signed URL to object storage). |
| POST | /v1/videos/upload/{uploadId}/complete | Finalize upload. Triggers transcoding pipeline. Body: { title, description, tags[], categoryId, visibility } |
Multipart upload is essential for large files. The client uploads directly to object storage using pre-signed URLs, bypassing our application servers entirely. This keeps upload bandwidth off our compute layer.
Video Metadata API
| Method | Endpoint | Description |
|---|---|---|
| GET | /v1/videos/{videoId} | Get video metadata (title, description, view count, manifest URL, thumbnails). |
| GET | /v1/videos/{videoId}/stream | Returns the HLS/DASH manifest URL (redirects to CDN). Player uses this to begin adaptive playback. |
| GET | /v1/search?q={query}&cursor=&limit=20 | Search videos by text query. Returns paginated results with cursor. |
| GET | /v1/feed?cursor=&limit=20 | Personalized recommendation feed for authenticated user. |
| POST | /v1/videos/{videoId}/views | Record a view event. Fire-and-forget from the client. |
Design decision: Why pre-signed URLs for upload? Uploading multi-gigabyte files through our API servers would consume enormous bandwidth and create a bottleneck. Pre-signed URLs let clients upload directly to S3 (or GCS), and we only handle the lightweight metadata requests. This pattern scales independently of upload volume.
Stage 3: Data Model
A video streaming platform demands polyglot persistence — no single database handles all access patterns optimally.
Video Metadata (PostgreSQL / Vitess)
| Column | Type | Notes |
|---|---|---|
video_id | UUID (PK) | Globally unique identifier |
creator_id | UUID (FK) | References users table |
title | varchar(500) | Searchable |
description | text | Searchable |
status | enum | uploading, processing, ready, failed, removed |
visibility | enum | public, unlisted, private |
duration_seconds | int | Set after transcoding completes |
manifest_path | varchar | S3 path to HLS master playlist |
thumbnail_urls | jsonb | Auto-generated + custom thumbnails |
tags | text[] | Used for search and recommendations |
created_at | timestamp | |
updated_at | timestamp |
User Data (PostgreSQL)
| Column | Type | Notes |
|---|---|---|
user_id | UUID (PK) | |
username | varchar (unique) | |
email | varchar (unique) | |
subscriber_count | bigint | Denormalized counter |
created_at | timestamp |
View Counts (Redis + Cassandra)
View counting is a special problem at this scale. We use a two-tier approach:
- Redis — Real-time counter. Each view event increments
INCR views:{video_id}. The value displayed to users reads from Redis. - Cassandra — Durable event log. Every view event is written to a Cassandra table partitioned by
video_idand bucketed by date. This feeds analytics, monetization, and reconciliation jobs that periodically sync Redis counters.
Comments (Cassandra)
| Column | Type | Role |
|---|---|---|
video_id | UUID | Partition key |
comment_id | TimeUUID | Clustering key (DESC) |
user_id | UUID | |
content | text | |
parent_comment_id | UUID (nullable) | For threaded replies |
created_at | timestamp |
Search Index (Elasticsearch)
Video metadata (title, description, tags, creator name) is indexed in Elasticsearch. A CDC pipeline (Debezium or application-level dual writes) keeps the search index in sync with the primary PostgreSQL store. Elasticsearch handles full-text search, fuzzy matching, and relevance scoring.
Storage Choice Rationale
- PostgreSQL for metadata: Relational integrity for users and video metadata. At hundreds of millions of rows (not billions), sharded PostgreSQL (Vitess) handles the load. Strong consistency for ownership and visibility controls.
- Cassandra for views and comments: Append-heavy, partition-friendly access patterns. Views are partitioned by video_id with date bucketing. Comments are partitioned by video_id for co-located reads.
- Redis for real-time counters: Sub-millisecond reads for view counts displayed on every page load. Eventual consistency with Cassandra is acceptable — a count being off by a few hundred on a video with millions of views is invisible to users.
- Elasticsearch for search: Full-text search with relevance ranking, autocomplete, and typo tolerance are all native capabilities.
- Object storage (S3/GCS) for video files: The actual video segments, manifests, and thumbnails. Virtually unlimited capacity with 11 nines of durability.
Stage 4: High-Level Architecture
The architecture splits cleanly into two major flows: upload and processing (write path) and playback (read path).
Upload Pipeline
- Client initiates upload by calling
/v1/videos/upload/init. The Upload Service creates a video record with statusuploadingin PostgreSQL and returns pre-signed URLs for multipart upload directly to S3. - Client uploads chunks directly to object storage using the pre-signed URLs. Each chunk is typically 5-10 MB. The client can upload multiple chunks in parallel.
- Client finalizes by calling
/v1/videos/upload/complete. The Upload Service verifies all parts are present, assembles the object in S3, updates the video status toprocessing, and publishes a message to the Transcoding Queue (SQS or Kafka). - Transcoding Pipeline (described in detail in Stage 5) consumes the message, transcodes the video into multiple resolutions, generates HLS manifests and thumbnails, and writes all output to S3.
- On completion, the pipeline updates the video record to status
readywith the manifest path. The video is now playable.
Playback Flow
- Client requests video metadata via
/v1/videos/{videoId}. The API returns the manifest URL pointing to the CDN. - Client fetches the HLS master manifest from the CDN edge. This manifest lists available quality levels (resolutions and bitrates).
- The video player selects an initial quality based on estimated bandwidth and requests the corresponding media playlist (a list of 2-10 second segment URLs).
- The player downloads segments sequentially from the CDN. If bandwidth changes, the player switches to a different quality level seamlessly — this is adaptive bitrate streaming.
- If a segment is not cached at the CDN edge, the CDN fetches it from the origin (S3) and caches it for subsequent requests.
Transcoding DAG
Transcoding is not a single operation but a directed acyclic graph of tasks:
- Probe — Inspect the input file (codec, resolution, duration, audio channels).
- Split — Divide the video into segments (typically 2-10 seconds each) for parallel processing.
- Transcode (video) — For each segment, encode into each target resolution and codec. This is the most compute-intensive step and runs in parallel across segments and resolutions.
- Transcode (audio) — Extract and encode audio tracks (AAC, Opus) at multiple bitrates.
- Generate thumbnails — Extract representative frames at regular intervals for the timeline scrubber and poster images.
- Package — Assemble HLS/DASH manifests that reference the transcoded segments. Write the master manifest linking all quality levels.
- Validate — Run quality checks (duration matches, no corrupted segments, manifest parseable).
- Publish — Update video status to
ready, push manifests to CDN warm-up.
Stage 5: Deep Dive
We will dive deep into two critical subsystems: the transcoding pipeline and adaptive bitrate streaming.
Deep Dive 1: Transcoding Pipeline
At 500 hours of video uploaded per minute, the transcoding system must handle massive throughput while remaining cost-efficient and fault-tolerant.
Architecture
The pipeline is orchestrated by a workflow engine (such as AWS Step Functions, Temporal, or Apache Airflow). Each uploaded video triggers a workflow instance that manages the DAG of transcoding tasks.
- Workers are stateless containers running FFmpeg. They pull tasks from a queue, process a single segment at a specific resolution, and write the output to S3.
- Parallelism is the key to performance. A 10-minute video split into 2-second segments produces 300 segments. Each segment is transcoded into 6 resolutions independently, yielding 1,800 tasks that can run in parallel.
- Spot/preemptible instances reduce cost by 60-80%. Transcoding tasks are idempotent — if a spot instance is reclaimed, the task is simply retried on another worker.
Encoding Ladder
The encoding ladder defines which resolution-bitrate combinations to produce:
| Resolution | Bitrate (H.264) | Bitrate (H.265) | Use Case |
|---|---|---|---|
| 240p | 300 kbps | 150 kbps | Very slow mobile connections |
| 360p | 600 kbps | 300 kbps | Mobile on 3G |
| 480p | 1.2 Mbps | 600 kbps | Standard mobile |
| 720p | 3 Mbps | 1.5 Mbps | Desktop / tablet |
| 1080p | 6 Mbps | 3 Mbps | HD desktop / smart TV |
| 4K | 16 Mbps | 8 Mbps | 4K displays |
A more advanced system uses per-title encoding: analyzing the content complexity of each video to determine optimal bitrates. An animation with flat colors compresses far better than a fast-action sports clip at the same resolution. Netflix pioneered this approach, saving 20% bandwidth without perceptible quality loss.
Fault Tolerance
- Idempotent tasks: Each task writes to a deterministic S3 path (
videos/{video_id}/{resolution}/{segment_number}.ts). Re-running produces identical output. - Dead letter queue: Tasks that fail after 3 retries are sent to a DLQ for manual investigation. The video status transitions to
failedwith a diagnostic error code. - Partial availability: If only some resolutions succeed, the system can publish the video with available resolutions and re-queue the failed ones. A video with 480p and 720p is better than no video at all.
Deep Dive 2: Adaptive Bitrate Streaming (HLS/DASH)
Adaptive bitrate (ABR) streaming is the mechanism that lets a video player switch quality levels mid-stream based on real-time network conditions.
How HLS Works
HTTP Live Streaming (HLS) uses a two-level manifest structure:
- Master Manifest (
master.m3u8) — Lists all available quality variants with their resolution, bitrate, and codec. The player downloads this first.
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=300000,RESOLUTION=426x240
240p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1200000,RESOLUTION=854x480
480p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1920x1080
1080p/playlist.m3u8- Media Playlist (
480p/playlist.m3u8) — Lists the actual video segment URLs with their durations. The player downloads segments sequentially.
#EXTM3U
#EXT-X-TARGETDURATION:4
#EXTINF:4.0,
segment_001.ts
#EXTINF:4.0,
segment_002.ts
#EXTINF:3.8,
segment_003.tsABR Algorithm
The player's ABR algorithm is the brain of the streaming experience. It must balance three competing goals:
- Maximize quality — Play at the highest resolution the network can sustain.
- Minimize rebuffering — Never let the playback buffer run empty (causes stalling).
- Minimize startup time — Start playback fast, even if it means beginning at a lower resolution.
A simplified ABR strategy:
- Start at the lowest quality for fast first-frame rendering (under 2 seconds).
- Measure download throughput for each segment. If the last 3 segments downloaded at 5 Mbps, the player can safely switch to 1080p (which requires 6 Mbps — apply a 0.8 safety margin, so effective threshold is 4.8 Mbps).
- Monitor the buffer level. If the buffer drops below 5 seconds, immediately drop to a lower quality regardless of throughput estimates.
- Use a ramp-up delay — do not jump from 240p to 4K in one step. Increase one quality level per segment to avoid overshooting.
HLS vs. DASH
| Feature | HLS | DASH |
|---|---|---|
| Developed by | Apple | MPEG (industry standard) |
| Container | .ts (MPEG-TS) or .fmp4 | .mp4 (fragmented) |
| Manifest format | .m3u8 (text-based) | .mpd (XML-based) |
| Browser support | Native on Safari; MSE-based on others | MSE-based on all browsers |
| DRM support | FairPlay | Widevine, PlayReady |
| Codec flexibility | Good (CMAF adds parity) | Excellent |
In practice, most platforms produce CMAF (Common Media Application Format) — fragmented MP4 segments that are compatible with both HLS and DASH manifests. This means you encode once and generate two manifest formats, avoiding duplicate storage.
Segment Size Trade-Off
- Shorter segments (2 seconds): Faster quality switching, lower latency for live streams, but more HTTP requests and higher CDN overhead (more objects to cache).
- Longer segments (10 seconds): Better compression efficiency and fewer requests, but slower adaptation to bandwidth changes and higher startup latency.
- The sweet spot for VOD is typically 4-6 seconds. This balances compression efficiency with responsive adaptation.
Stage 6: Scaling and Trade-Offs
CDN Strategy
At 1 billion views per day, the CDN is the most critical infrastructure component. Without it, origin servers would be crushed under the load.
- Multi-CDN: Use multiple CDN providers (CloudFront, Akamai, Fastly) and route viewers to the best-performing CDN based on real-time latency data. This also provides failover if one CDN experiences an outage.
- Tiered caching: CDN edges (hundreds of PoPs globally) cache popular segments. Regional mid-tier caches sit between edges and the origin. A cache miss at the edge checks the mid-tier before hitting origin. This reduces origin load by 95%+ for popular content.
- Popularity-based pre-warming: For trending or newly released videos, proactively push segments to edge caches in regions with high expected viewership. A video from a creator with 50 million subscribers should be pre-cached before the notification goes out.
- Long-tail optimization: 80% of views go to 20% of videos. The long tail of rarely watched content will always be a cache miss. For these, serve from a single origin region and accept higher latency rather than polluting cache with rarely accessed segments.
View Counting at Scale
1 billion views per day is approximately 11,500 view events per second. Naively incrementing a database counter per view would create a massive write hotspot for popular videos.
- Client-side batching: The player sends a view event after 30 seconds of watch time (to filter out accidental clicks). This reduces total events by approximately 40%.
- Write buffering in Redis: View events are batched in Redis (
INCRBY views:{video_id} 1). A background job flushes accumulated counts to Cassandra every 60 seconds. - Approximate counts: For real-time display, the Redis counter is accurate enough. For analytics and monetization, the Cassandra event log provides exact counts after reconciliation.
- Sharded counters: For viral videos with millions of concurrent viewers, a single Redis key becomes a hotspot. Use N sharded keys (
views:{video_id}:{shard}) and sum them on read. This trades read complexity for write throughput.
Storage Tiers: Hot, Warm, Cold
With 30 TB of raw uploads per day multiplied by 5-8x for transcoded variants, storage costs grow rapidly. A tiered approach keeps costs manageable:
| Tier | Storage Class | Content | Access Pattern |
|---|---|---|---|
| Hot | S3 Standard | Videos viewed in the last 7 days, trending content | Frequent reads via CDN origin pulls |
| Warm | S3 Infrequent Access | Videos with 1-100 views/month | Occasional reads, slightly higher retrieval latency |
| Cold | S3 Glacier Instant Retrieval | Videos with less than 1 view/month | Rare reads, millisecond retrieval on demand |
S3 Lifecycle policies automatically transition objects between tiers based on access patterns. The master copy (highest quality transcode) is always retained. Lower resolutions of cold content can be deleted and re-transcoded on demand if a viewer requests them — this saves significant storage at the cost of occasional transcoding latency.
Copyright Detection (Content ID)
A video streaming platform at scale must detect copyrighted content to avoid legal liability and protect creators.
- Audio fingerprinting: Extract a perceptual hash of the audio track and compare against a database of copyrighted works. Algorithms like Chromaprint produce compact fingerprints that are robust to compression, pitch shifts, and background noise.
- Video fingerprinting: Extract visual fingerprints from keyframes. Compare against a reference database using similarity search (approximate nearest neighbors via FAISS or ScaNN).
- Pipeline integration: Content ID runs as a step in the transcoding DAG. After transcoding completes, the fingerprinting step runs before the video is published. If a match is found, the video is flagged for review or automatically handled per the copyright holder's policy (block, monetize, or track).
- Scale: With 500 hours uploaded per minute, the fingerprinting system must process content faster than real-time. Batch processing on GPU clusters (for video fingerprinting) and CPU workers (for audio) handles the throughput.
Scoring Tips
To score well on a video streaming design question, keep these principles in mind:
- Separate upload from playback. These are fundamentally different systems with different latency requirements, scale characteristics, and failure modes. Interviewers expect you to treat them as independent flows.
- Explain the transcoding pipeline in detail. This is where most of the technical complexity lives. Show you understand the DAG structure, parallelization strategy, encoding ladders, and fault tolerance. Mentioning per-title encoding or CMAF demonstrates depth.
- Know adaptive bitrate streaming cold. Be able to explain the two-level manifest structure, how the ABR algorithm selects quality, and the segment size trade-off. This is the core technology that makes streaming work.
- Quantify your CDN strategy. Do not just say "use a CDN." Explain tiered caching, cache hit ratios, pre-warming for popular content, and the long-tail problem. Show you understand that the CDN is not a magic box — it has capacity limits and cache eviction policies.
- Address the hard scaling problems proactively. View counting at scale, storage tiering, and content moderation are areas where interviewers probe for production-level thinking. A candidate who brings up sharded counters and hot/warm/cold storage unprompted stands out.
- Show cost awareness. Storage and CDN bandwidth are the two largest cost drivers. Mentioning spot instances for transcoding, S3 lifecycle policies, and multi-CDN cost optimization signals that you think about systems holistically — not just correctness and performance.
Practice delivering this architecture end-to-end in under 35 minutes. Focus on smooth transitions between stages — requirements should naturally motivate your API design, which should inform your data model, which feeds into your architecture. If you can walk through each stage while fielding follow-up questions confidently, you are well-prepared. Tools like Hoppers AI can help you rehearse this flow with real-time feedback on structure, pacing, and technical depth.