Introduction
Recommendation engines power some of the most valuable products on the internet. Netflix estimates its recommendation system saves over $1 billion per year in reduced churn. YouTube's recommendations drive more than 70% of total watch time. Amazon attributes 35% of its revenue to personalized suggestions. When an interviewer asks you to design a recommendation engine, they are testing your ability to combine distributed systems, machine learning infrastructure, and data engineering into a cohesive architecture that serves personalized results at massive scale.
This walkthrough follows a structured 6-stage approach: Requirements, API Design, Data Model, High-Level Architecture, Deep Dive, and Scaling. Each stage mirrors how a strong candidate would navigate the problem in a 45-minute system design interview.
Stage 1: Requirements
Before drawing a single box, spend 3-5 minutes aligning with the interviewer on scope. A recommendation engine can mean vastly different things depending on the product. Pin down the domain, then extract functional and non-functional requirements.
Functional Requirements
- Personalized recommendations: Given a user, return a ranked list of items tailored to their preferences and behavior history.
- Real-time signals: Incorporate recent user actions (clicks, purchases, views) within seconds, not just historical batch data.
- Batch model training: Periodically retrain recommendation models on the full interaction dataset to capture long-term preference shifts.
- A/B testing: Support running multiple recommendation algorithms simultaneously and measuring their impact on engagement metrics.
- Cold start handling: Provide reasonable recommendations for brand-new users (no history) and newly added items (no interactions).
Non-Functional Requirements
- 300 million registered users, 50 million daily active users.
- Sub-100ms p99 latency for serving recommendations — any slower and users notice the delay.
- 1 billion interaction events per day (views, clicks, purchases, ratings).
- High availability: 99.99% uptime — recommendations are on the critical path of the product experience.
- Eventual consistency is acceptable: A user who just purchased an item might still see it recommended for a few seconds. This is fine.
Back-of-Envelope Estimates
| Metric | Value |
|---|---|
| DAU | 50M |
| Recommendation requests/day | ~500M (10 per session avg) |
| Peak QPS | ~15,000 (2x average, bursty) |
| Events ingested/day | 1B |
| Event ingestion rate (avg) | ~12,000/sec |
| Item catalog size | 10M items |
| User embedding size | ~256 floats = 1 KB per user |
| Total user embeddings | 300M users x 1 KB = ~300 GB |
These numbers tell us that the serving layer needs to be heavily cached and read-optimized, the event pipeline must handle sustained high throughput, and user/item embeddings are too large for a single machine's memory.
Stage 2: API Design
Define the external-facing contracts. Keep them simple — recommendation APIs should be thin wrappers over the ranking pipeline.
Get Recommendations
GET /v1/recommendations?user_id={uid}&context={ctx}&count=20&page_token={token}
Response:
items: array of{ item_id, score, reason, metadata }experiment_id: which A/B variant served this responserequest_id: for debugging and attributionnext_page_token: cursor for pagination
The context parameter captures where the recommendation is being shown (home feed, product page, search results) because different surfaces may use different models or candidate pools.
Record Interaction
POST /v1/interactions
Body: { user_id, item_id, action, timestamp, context, metadata }
Actions: view, click, purchase, rating, dismiss. This is a fire-and-forget endpoint — the client does not wait for processing. It feeds the event pipeline that updates features and retrains models.
A/B Experiment Configuration
POST /v1/experiments
Body: { name, variants: [{ model_id, traffic_percentage }], metrics: ["ctr", "conversion_rate"], duration_days }
This is an internal API for the ML team. It configures which models serve which traffic segments. The recommendation endpoint reads experiment assignments at request time.
Stage 3: Data Model
The data model underpins everything. Get this right and the architecture follows naturally.
User Profiles
| Field | Type | Notes |
|---|---|---|
| user_id | string (UUID) | Primary key |
| demographics | JSON | Age bucket, country, language |
| preferences | JSON | Explicit preferences (categories, brands) |
| embedding | float[256] | Learned user representation |
| last_active | timestamp | For recency-weighted features |
Item Catalog
| Field | Type | Notes |
|---|---|---|
| item_id | string (UUID) | Primary key |
| title | string | Display name |
| category | string[] | Hierarchical categories |
| content_features | JSON | Tags, description embedding, price |
| embedding | float[256] | Learned item representation |
| popularity_score | float | Time-decayed global popularity |
| created_at | timestamp | For cold-start detection |
Interaction Events
| Field | Type | Notes |
|---|---|---|
| event_id | string (UUID) | Dedup key |
| user_id | string | Foreign key |
| item_id | string | Foreign key |
| action | enum | view, click, purchase, rating, dismiss |
| timestamp | int64 | Unix millis |
| context | string | Surface where interaction occurred |
| experiment_id | string | Which variant was shown |
Feature Store Schema
The feature store bridges raw data and model serving. It maintains two layers:
- Offline features (batch-computed, refreshed hourly/daily): user lifetime purchase count, item average rating, user-category affinity matrix, item co-occurrence scores.
- Online features (real-time, updated within seconds): user's last 20 viewed items, session click count, trending items in user's region.
Each feature is stored as a key-value pair: (entity_id, feature_name) → feature_value. The online store uses Redis or DynamoDB for sub-millisecond lookups. The offline store uses a columnar format (Parquet on S3) for efficient batch reads during model training.
Stage 4: High-Level Architecture
The architecture follows a well-established pattern: a two-stage retrieval pipeline for serving, combined with a Lambda architecture (batch + real-time) for data processing.
Two-Stage Retrieval Pipeline
Scoring every item in a 10M catalog against a user model at request time is computationally infeasible within 100ms. The industry-standard solution splits retrieval into two stages:
Stage 1 — Candidate Generation: Multiple lightweight retrieval models each produce a few hundred candidates from the full catalog. This narrows the field from millions to hundreds. Common approaches:
- Collaborative filtering (ANN): Find the nearest items to the user's embedding using approximate nearest neighbor search (HNSW, ScaNN, or FAISS). Retrieves items that similar users have interacted with.
- Content-based filtering: Match user preference vectors against item content embeddings. Retrieves items similar to what the user has liked before.
- Popularity-based: Return globally or regionally trending items. Acts as a fallback and diversity injector.
- Co-occurrence: "Users who bought X also bought Y." Precomputed item-to-item similarity scores.
Stage 2 — Ranking: A heavier ML model (gradient-boosted trees or a neural network) scores each candidate using a rich feature set: user features, item features, cross features (user-item affinity), and contextual features (time of day, device, session depth). The model predicts the probability of the desired action (click, purchase). The top-N scored items are returned.
Lambda Architecture for Data Processing
The system needs both real-time responsiveness and batch accuracy:
Batch layer: A Spark or Flink batch job runs daily/hourly. It processes the full interaction history, retrains embeddings (matrix factorization, two-tower neural models), computes aggregate features (item popularity, user-category affinities), and writes results to the offline feature store and the ANN index.
Speed layer: A Kafka-backed streaming pipeline (Flink or Kafka Streams) processes interaction events in near-real-time. It updates the online feature store (recent views, session-level signals) and publishes updated trending scores. These real-time features are merged with batch features at serving time.
Serving layer: The recommendation service reads from both the online and offline feature stores, runs the two-stage retrieval pipeline, and returns ranked results. It is stateless and horizontally scalable behind a load balancer.
Component Interaction
A request flows through the system as follows:
- Client calls
GET /v1/recommendationswith user context. - The API gateway routes to the recommendation service.
- The service looks up the user's experiment assignment to determine which model variant to use.
- Candidate generation retrieves ~500 candidates from multiple sources in parallel (ANN index, co-occurrence table, popularity cache).
- The ranking model fetches features from the online feature store (sub-ms Redis reads) and scores all candidates.
- Re-ranking applies business rules: remove already-purchased items, enforce diversity constraints (no more than 3 items from the same category), apply content policy filters.
- The top-N results are returned with metadata and the experiment ID for attribution.
Stage 5: Deep Dive
In a real interview, the interviewer will ask you to go deep on one or two components. The two most common deep dives for recommendation engines are candidate generation algorithms and the cold start problem.
Deep Dive 1: Candidate Generation Algorithms
Candidate generation is the most architecturally interesting component because it must search millions of items in single-digit milliseconds.
Collaborative Filtering via Matrix Factorization: The classical approach decomposes the user-item interaction matrix into two lower-rank matrices — user embeddings and item embeddings. The dot product of a user embedding and an item embedding approximates the predicted interaction score. Training uses Alternating Least Squares (ALS) or Stochastic Gradient Descent (SGD) on implicit feedback (views, clicks) rather than explicit ratings.
At serving time, finding the items whose embeddings are closest to the user's embedding is an approximate nearest neighbor (ANN) search. Libraries like FAISS (Facebook), ScaNN (Google), or managed services like Pinecone and Milvus build specialized index structures (IVF, HNSW, product quantization) that reduce search from O(n) to O(log n) or better.
Two-Tower Neural Model: The modern evolution of matrix factorization. A user tower and an item tower are trained jointly as a deep neural network. The user tower ingests user features (demographics, recent interactions, session context) and outputs a user embedding. The item tower ingests item features (category, content embedding, popularity) and outputs an item embedding. Training maximizes the dot-product similarity between positive user-item pairs while minimizing it for negative (sampled) pairs.
The key advantage over classical matrix factorization is that the towers can incorporate arbitrary features beyond just interaction history. This is critical for handling cold-start items — a new item with rich content features gets a meaningful embedding even before any user interacts with it.
Retrieval architecture for ANN at scale:
- Item embeddings are precomputed during batch training and loaded into an ANN index.
- The index is sharded across multiple machines if the catalog exceeds single-machine memory.
- Each shard returns its top-K candidates; a merge step selects the global top-K.
- Index rebuilds happen daily; delta updates (new items) are appended to a secondary index and merged during the next full rebuild.
- Typical ANN latency: 2-5ms for 10M items with HNSW (recall@100 > 95%).
Multiple retrieval channels: Production systems run 4-6 candidate generators in parallel and union their results. This improves diversity and coverage. The ranking model then re-scores the merged candidate set, so even if individual generators have imperfect recall, the overall system captures diverse user interests.
Deep Dive 2: The Cold Start Problem
Cold start is the Achilles heel of recommendation systems. There are two variants, and each requires a different solution.
New User Cold Start: A user with no interaction history has no learned embedding. Strategies:
- Onboarding survey: Ask the user to select a few preferred categories or items during signup. Map these selections to a rough preference vector.
- Demographic-based priors: Use the user's country, language, age bucket, and device type to assign them to a demographic cluster. Serve that cluster's average recommendation set until enough interactions accumulate.
- Popularity fallback: Serve globally popular or trending items. These have high baseline engagement rates and generate initial interactions quickly.
- Exploration/exploitation: Use a multi-armed bandit (Thompson sampling or epsilon-greedy) to explore diverse items for new users. As interactions accumulate, shift toward exploitation (serving predicted-best items).
- Transition strategy: After ~20-30 interactions, the user has enough signal for the collaborative filtering model to produce personalized results. Smoothly blend from cold-start strategies to the full model over the first few sessions.
New Item Cold Start: An item with no interactions has no collaborative signal. Strategies:
- Content-based embedding: The two-tower model's item tower generates an embedding from content features alone (title, description, category, image embeddings). This provides a reasonable starting point.
- Boosted exploration: Inject new items into a small percentage of recommendation responses to gather initial interaction data. Measure engagement and update the item's collaborative signal.
- Similar-item transfer: Find the most similar existing items (by content embedding) and bootstrap the new item's interaction statistics from theirs.
- Publisher/creator reputation: If a trusted creator releases a new item, borrow engagement priors from their other items.
Interview tip: Interviewers love to hear you discuss cold start because it reveals whether you understand the limitations of pure collaborative filtering. Always mention both variants and at least two mitigation strategies for each. Connecting cold start to the two-tower model architecture shows deep understanding.
Stage 6: Scaling
With the architecture established, the interviewer will push you on how this system scales to 300M users and sub-100ms latency. Focus on three areas: the feature store, model serving, and A/B testing infrastructure.
Scaling the Feature Store
The feature store is the most latency-sensitive component in the serving path. Every recommendation request requires 500+ feature lookups (one per candidate, plus user features).
- Online store: Use Redis Cluster with read replicas. Partition features by entity type (user features on one cluster, item features on another). Pre-join cross-features during batch processing to avoid joins at serving time.
- Batch reads: Fetch user features once per request (1 Redis call), then batch-fetch item features for all candidates in a single
MGETcall (1-2 Redis calls). This keeps the total Redis round-trips to 2-3, regardless of candidate count. - Feature caching: Cache frequently accessed item features (popular items) in application memory with a 5-minute TTL. This can eliminate 60-70% of Redis reads because item feature distributions follow a power law.
- Feature freshness SLAs: Online features (last-viewed items) must be fresh within 5 seconds. Offline features (user-category affinity) can be up to 6 hours stale. Set TTLs and refresh cadences accordingly.
Model Serving Infrastructure
The ranking model must score 500 candidates within 30-40ms (leaving the remaining budget for network and candidate generation).
- Model format: Export trained models to optimized formats (ONNX, TensorRT, or TensorFlow SavedModel). Use hardware-accelerated inference (GPU for neural models, CPU for gradient-boosted trees).
- Batched inference: Score all 500 candidates in a single forward pass rather than 500 individual predictions. This amortizes model loading and GPU kernel launch overhead.
- Model versioning: Maintain a model registry (MLflow, internal tooling) that tracks model lineage, training data, and evaluation metrics. Roll out new models gradually using A/B tests.
- Horizontal scaling: Model serving pods are stateless — scale them with CPU/GPU utilization. Use Kubernetes HPA or AWS Auto Scaling Groups. Typical sizing: 20-50 inference pods for 15K QPS, depending on model complexity.
- Shadow mode: Deploy new models in shadow mode first — they receive production traffic and score candidates, but their results are logged (not returned to users). Compare offline metrics before promoting to live traffic.
A/B Testing at Scale
A/B testing is non-negotiable for recommendation systems. Without it, you cannot know if a model change actually improves user engagement.
- User-level assignment: Hash the user ID with the experiment ID to deterministically assign users to variants. This ensures a user always sees the same variant within an experiment (no flickering) and allows multiple concurrent experiments via layered hashing.
- Metrics pipeline: Every recommendation response logs the experiment ID and variant. The event pipeline joins recommendation logs with subsequent interaction events (clicks, purchases) to compute per-variant metrics: CTR, conversion rate, revenue per user, session duration.
- Statistical rigor: Use sequential testing (not fixed-horizon t-tests) to allow early stopping when a variant is clearly winning or losing. Account for multiple comparisons when running many experiments simultaneously. Minimum detectable effect: 1% relative change in CTR with 95% confidence, requiring ~2M users per variant for a week.
- Guardrail metrics: Beyond the primary metric, monitor guardrails: recommendation diversity (Gini coefficient), coverage (percentage of catalog surfaced), latency percentiles, and user complaint rates. A model that improves CTR but tanks diversity or latency should not be promoted.
| Component | Technology | Latency Budget |
|---|---|---|
| API Gateway / Load Balancer | Envoy / ALB | 2-3ms |
| Candidate Generation (ANN) | FAISS / ScaNN | 5-10ms |
| Feature Store Reads | Redis Cluster | 3-5ms |
| Ranking Model Inference | ONNX Runtime / TensorRT | 20-35ms |
| Re-ranking / Business Rules | Application logic | 2-5ms |
| Serialization + Network | gRPC / HTTP | 5-10ms |
| Total | 37-68ms (within 100ms budget) |
Scoring Tips
Interviewers evaluate recommendation engine designs on several dimensions. Here is how to maximize your score in each:
- Requirement scoping: Always clarify the domain (e-commerce, media, social), the interaction type (implicit vs. explicit feedback), and the latency/throughput requirements before designing. This shows product sense.
- Two-stage retrieval: Mentioning the candidate generation + ranking split is practically a requirement. Candidates who try to rank the entire catalog in one pass reveal a lack of industry knowledge.
- Cold start awareness: Proactively bringing up cold start — before the interviewer asks — demonstrates depth. Connect your solution to the architecture (content-based tower, exploration budget).
- Feature engineering: Discuss specific feature categories (user history, item attributes, cross features, contextual signals). This shows you understand what makes ML models performant, not just the infrastructure.
- Latency budgeting: Break down the 100ms budget across components. Interviewers want to see that you can reason about end-to-end performance, not just individual services.
- A/B testing discipline: Mention that no model change goes live without an A/B test. Discuss guardrail metrics and statistical methodology. This signals ML engineering maturity.
- Trade-off articulation: At every decision point, name the trade-off explicitly. "I chose HNSW over brute-force because we need sub-5ms retrieval, accepting a ~5% recall loss at top-100." Strong candidates narrate their reasoning; weak candidates just state choices.
Common pitfalls to avoid: Do not spend 15 minutes on the data model — it is important but not where you differentiate. Do not propose a single monolithic model that scores all items — always split into retrieval and ranking. Do not ignore real-time signals — batch-only systems feel outdated. Do not forget to discuss how you measure success (A/B testing, offline evaluation metrics like NDCG and MAP).
A recommendation engine design tests breadth (distributed systems, ML infrastructure, data engineering) and depth (ANN algorithms, cold start strategies, feature stores). Nail the two-stage retrieval architecture, demonstrate awareness of cold start challenges, and budget your latency carefully — and you will deliver a compelling design.
Want to practice this question with real-time feedback on your structure, trade-off analysis, and communication clarity? Hoppers AI offers AI-powered mock system design interviews that evaluate your response across all six stages and help you identify exactly where to improve.