Design a Distributed Cache — System Design Interviews

Design a Distributed Cache — Complete System Design Walkthrough

Distributed caches are the backbone of modern high-performance systems. Every major tech company — from Amazon to Netflix — relies on in-memory caching layers to achieve the sub-millisecond response times users expect. Designing one from scratch is a classic system design interview question because it tests your understanding of hashing, replication, memory management, and failure handling. In this walkthrough, we will design a distributed cache system end-to-end, following a structured 6-stage approach.

Stage 1: Requirements Gathering

Begin every system design interview by clarifying scope. Spend 3-5 minutes establishing what you are building and at what scale. This is where strong candidates separate themselves.

Functional Requirements

Core operations — GET(key), SET(key, value, ttl), DELETE(key). These are the primitives every cache must support.
TTL (Time-To-Live) — Keys expire automatically after a configurable duration. Clients can set per-key TTLs on write.
Eviction policies — When memory is full, the cache must evict entries. Support at least LRU (Least Recently Used) and LFU (Least Frequently Used).
Batch operations — MGET and MSET for retrieving or writing multiple keys in a single round-trip to reduce network overhead.
Key enumeration — Ability to scan keys by prefix or pattern (non-blocking, cursor-based).

Non-Functional Requirements

Throughput: ~1 million operations per second across the cluster.
Latency: Sub-millisecond p99 for single-key GET/SET operations within a data center.
Availability: 99.99% uptime. Cache unavailability cascades into database overload.
Scalability: Horizontally scalable — add nodes to increase capacity without downtime.
Consistency: Eventual consistency for replicas is acceptable. Single-key operations on the primary are linearizable.
Memory efficiency: Minimize overhead per key. With 100 million keys averaging 500 bytes each, we need approximately 50 GB of usable cache memory across the cluster.

Interview tip: Explicitly state your scale numbers. Saying "1 million ops/sec" gives you concrete throughput targets to reference when sizing nodes later. It also signals to the interviewer that you reason from constraints, not intuition.

Stage 2: API Design

A distributed cache exposes a simple interface, but the API must account for batch operations, conditional writes, and configuration. Define the external contract before thinking about internals.

Core Cache Operations

Operation	Signature	Description
`GET`	`GET key`	Returns value or null if key does not exist or has expired.
`SET`	`SET key value [EX seconds] [PX milliseconds] [NX\|XX]`	Sets key to value with optional TTL. `NX`: only if key does not exist. `XX`: only if key exists.
`DELETE`	`DELETE key`	Removes the key and its value.
`MGET`	`MGET key1 key2 ... keyN`	Returns values for multiple keys in a single round-trip.
`MSET`	`MSET key1 value1 key2 value2 ...`	Sets multiple key-value pairs atomically on each node.
`TTL`	`TTL key`	Returns remaining time-to-live in seconds (-1 if no TTL, -2 if key missing).
`SCAN`	`SCAN cursor [MATCH pattern] [COUNT hint]`	Cursor-based key enumeration. Non-blocking, returns a batch of keys and the next cursor.

Conditional Writes

The NX and XX flags on SET enable two important patterns:

Distributed locking — SET lock_key owner_id EX 30 NX acquires a lock only if it does not already exist, with automatic expiry as a deadlock safety net.
Cache refresh — SET key new_value XX updates an existing entry without creating one if the key has been evicted.

Client Configuration

Clients need to know the cluster topology. A configuration endpoint or service discovery mechanism provides the list of cache nodes, their hash ranges, and health status. On topology changes (node added/removed), the client library receives updated mappings and re-routes requests accordingly.

Design decision: We use a text-based protocol (like Redis's RESP) rather than HTTP. HTTP headers add hundreds of bytes of overhead per request. For a cache that serves millions of sub-millisecond requests per second, protocol overhead directly impacts throughput. RESP is simple, parseable, and adds minimal bytes per command.

Stage 3: Data Model

The internal data structures of a cache node determine its performance characteristics. Every microsecond matters when targeting sub-millisecond latency.

Key-Value Storage

Each cache node stores entries in a hash table as the primary index. The hash table maps keys to value pointers and metadata:

Field	Type	Size	Description
`key`	byte[]	variable	The cache key (max 512 bytes)
`value`	byte[]	variable	The cached value (max 1 MB)
`ttl_expiry`	uint64	8 bytes	Absolute expiry timestamp in milliseconds (0 = no expiry)
`last_accessed`	uint64	8 bytes	Timestamp of last GET/SET (used for LRU)
`access_count`	uint32	4 bytes	Approximate access frequency (used for LFU)
`size`	uint32	4 bytes	Total bytes consumed (key + value + metadata)

The per-entry metadata overhead is approximately 24 bytes. With 10 million keys per node, metadata accounts for ~240 MB — a reasonable trade-off for supporting both LRU and LFU eviction.

Hash Slots

The total keyspace is divided into a fixed number of hash slots (e.g., 16,384, as in Redis Cluster). Each key is mapped to a slot via CRC16(key) mod 16384. Slots are then assigned to nodes. This indirection layer makes rebalancing possible: moving a slot from one node to another is a bounded operation that affects only the keys in that slot.

Eviction Data Structures

LRU: A doubly-linked list ordered by access time, combined with hash table pointers. On each access, the entry is moved to the head. Eviction removes from the tail. O(1) for both operations.
LFU: A min-heap or frequency-bucketed linked list. Redis uses a probabilistic counter (8-bit logarithmic counter with decay) to approximate frequency without maintaining exact counts.
TTL expiry: A min-heap (priority queue) keyed by expiry timestamp. A background thread samples entries periodically and evicts expired ones. Lazy expiry also checks TTL on every GET and returns null if expired.

Memory Allocator

General-purpose allocators (like glibc malloc) suffer from fragmentation with cache workloads. Use a slab allocator that pre-allocates memory in fixed-size classes (64B, 128B, 256B, 1KB, 4KB, etc.). Each value is stored in the smallest slab class that fits. This trades some internal fragmentation for deterministic allocation performance and reduced external fragmentation.

Stage 4: High-Level Architecture

The architecture must distribute data across multiple nodes while maintaining sub-millisecond access. Three core problems drive the design: partitioning, routing, and replication.

Consistent Hashing

We use consistent hashing to assign hash slots to nodes. Nodes are placed on a hash ring based on their identifier. Each key hashes to a position on the ring and is stored on the first node encountered clockwise from that position.

Why consistent hashing over simple modular hashing (hash(key) % N)? When a node is added or removed, modular hashing remaps nearly every key, causing a cache avalanche. Consistent hashing only remaps keys that belong to the affected portion of the ring — roughly K/N keys, where K is the total number of keys and N is the number of nodes.

Partition Assignment

The 16,384 hash slots are distributed evenly across nodes. With 5 nodes, each node owns approximately 3,277 slots. The slot-to-node mapping is stored in a cluster configuration that all clients and nodes share. When a node joins or leaves, slots are redistributed and the configuration is propagated.

Replication

Each primary node has one or more replica nodes. Replication is asynchronous — the primary acknowledges writes immediately and streams updates to replicas in the background. This preserves sub-millisecond write latency at the cost of potential data loss on primary failure (a small window of unreplicated writes).

Replicas serve two purposes:

Failover: If a primary dies, a replica is promoted. The cluster remains available with minimal disruption.
Read scaling: Clients can optionally read from replicas for read-heavy workloads, accepting slightly stale data.

Request Routing

There are three common routing strategies:

Client-side routing: The client library knows the slot map and sends each request directly to the correct node. Lowest latency (one hop), but requires smart clients.
Proxy-based routing: A stateless proxy layer accepts all requests, hashes the key, and forwards to the correct node. Simpler clients, but adds a network hop (~0.1-0.3ms).
Redirect-based: The client sends to any node. If that node does not own the slot, it returns a redirect. The client retries at the correct node. Simple but adds latency on misses.

For our design, we use client-side routing as the primary mode (for performance-sensitive applications) with a proxy layer available for simpler clients. This mirrors how Redis Cluster operates in practice.

Stage 5: Deep Dive

We will go deep on two critical topics: consistent hashing with virtual nodes and hot key handling with thundering herd prevention.

Deep Dive 1: Consistent Hashing with Virtual Nodes

Naive consistent hashing — placing each physical node at a single point on the ring — leads to uneven data distribution. If nodes happen to cluster together on the ring, some nodes store far more keys than others. This problem worsens as nodes are added and removed.

Virtual Nodes

The solution is virtual nodes (vnodes). Each physical node is mapped to multiple points on the ring — typically 100 to 200 virtual nodes per physical node. The key is hashed to a position on the ring and assigned to the nearest virtual node clockwise, which maps back to a physical node.

Benefits of virtual nodes:

Uniform distribution: With 150+ vnodes per node, the standard deviation of keys per node drops to under 5% of the mean.
Heterogeneous hardware: A node with 64 GB of RAM can be assigned more vnodes than a node with 32 GB, achieving proportional data distribution without special logic.
Smoother rebalancing: When a node leaves, its vnodes are scattered across the ring. The keys from each vnode are picked up by different successor nodes, spreading the load increase evenly rather than dumping everything onto a single neighbor.

Hash Function Choice

The hash function must be fast and produce uniform distribution. xxHash or MurmurHash3 are common choices — both compute in under 100 nanoseconds for typical key sizes and produce well-distributed 64-bit hashes. Avoid cryptographic hashes (SHA-256) — they are 10x slower and the security properties are unnecessary for slot assignment.

Handling Node Addition

When a new node joins the cluster:

The new node is assigned its set of vnodes on the ring.
For each vnode, the new node becomes the owner of some keys that previously belonged to the next node clockwise.
A background slot migration process transfers the affected keys from the old owner to the new node.
During migration, the old node continues serving reads for migrating slots and redirects writes to the new owner. This ensures zero downtime.
Once migration is complete, the cluster configuration is updated and propagated.

Deep Dive 2: Hot Key Handling and Thundering Herd

In real-world systems, key access is rarely uniform. A viral tweet, a flash sale product, or a popular configuration value can concentrate millions of requests per second on a single key — a hot key. Since one key maps to one slot on one node, that node becomes a bottleneck.

Detecting Hot Keys

Each node maintains a probabilistic frequency counter (similar to a Count-Min Sketch) that tracks access frequency per key with minimal memory. Keys exceeding a configurable threshold (e.g., 5,000 requests/second) are flagged as hot. The proxy or client library can also track request distribution and identify hot keys client-side.

Mitigating Hot Keys

Local caching: The client library or proxy maintains a small in-process cache (L1 cache) for hot keys. The L1 cache has a short TTL (1-5 seconds) and absorbs most reads without hitting the cache node. This is the most effective mitigation — it eliminates network round-trips entirely for hot keys.
Key replication across slots: Internally, the system can replicate a hot key's value to multiple nodes. The client library appends a random suffix (e.g., key#1, key#2, ..., key#8) and distributes reads across these copies. This spreads load across nodes but increases write complexity.
Read from replicas: For hot keys that are read-heavy, route reads to replicas instead of the primary. This scales read capacity linearly with the number of replicas.

Thundering Herd (Cache Stampede)

When a popular key expires, hundreds of threads may simultaneously discover the cache miss and all attempt to regenerate the value from the database — a thundering herd. The database sees a sudden spike that can cascade into an outage.

Three complementary solutions:

Lock-based recomputation: When a cache miss occurs, the first requestor acquires a distributed lock (SET regen:key owner NX EX 10). Subsequent requestors wait briefly and retry the cache read. Only one thread hits the database. This is the most common approach.
Probabilistic early expiry (XFetch): Before the TTL expires, each request has a small probability of triggering a background refresh. The probability increases as the TTL approaches zero: shouldRefresh = (currentTime - (expiry - ttl * beta * log(rand()))) > 0. This ensures one requestor refreshes the key just before it expires, while all others continue reading the existing value.
Stale-while-revalidate: The cache stores a soft_ttl and a hard_ttl. After soft_ttl, the value is considered stale but still serveable. One background thread refreshes the value. The hard_ttl is the true expiry. This guarantees that the cache always has a value, even if slightly stale.

Interview tip: Naming the thundering herd problem and offering multiple mitigation strategies signals depth. Most candidates mention locking. Fewer mention XFetch or stale-while-revalidate. Covering all three and discussing trade-offs (latency vs. staleness vs. complexity) is a strong differentiator.

Stage 6: Scaling and Operational Concerns

Adding and Removing Nodes

The cluster must scale without downtime. The process for adding a node:

The new node registers with the cluster coordinator (a lightweight metadata service or gossip protocol).
The coordinator computes a new slot assignment that rebalances load.
Slot migration begins: for each slot moving to the new node, the source node streams all key-value pairs in that slot to the new node.
During migration, reads for migrating slots are served by the source node. Writes go to both the source and the new node (dual-write) to prevent data loss.
Once migration is complete, the configuration is updated atomically and propagated to all clients.

Removing a node follows the same process in reverse: its slots are redistributed among remaining nodes before it is decommissioned.

Memory Management

Memory is the scarcest resource in a cache. Effective management requires multiple strategies:

Max memory policy: Each node has a configured maximum memory limit. When reached, the eviction policy (LRU, LFU, or random) kicks in. Never let the OS start swapping — swapped cache pages defeat the purpose of in-memory caching.
Memory fragmentation monitoring: Track the ratio of RSS (Resident Set Size) to actual data stored. A ratio above 1.5 indicates fragmentation. Trigger a background defragmentation pass or restart the node with data reloaded from replicas.
Large key detection: Keys with values exceeding 10 KB should be flagged and monitored. Large keys cause uneven memory distribution and can block the event loop during serialization. Consider compressing values above a threshold or splitting them into chunks.

Failure Detection and Recovery

The cluster uses a gossip protocol for failure detection. Each node periodically pings a random subset of other nodes. If a node fails to respond after multiple attempts, it is marked as PFAIL (possibly failed). When a majority of nodes agree on the failure, the node is marked as FAIL and failover begins.

Failover process:

The replica with the most up-to-date replication offset is elected as the new primary.
The new primary takes ownership of all slots that belonged to the failed node.
Clients are notified of the topology change and update their routing tables.
Total failover time: typically 1-3 seconds (detection) + ~1 second (promotion) = 2-4 seconds of degraded availability.

Cross-Data-Center Replication

For global applications, deploy independent cache clusters in each data center with asynchronous cross-region replication. Each cluster is authoritative for its region. Writes are replicated to other regions with a delay (50-200ms depending on network distance). Reads are always served locally for sub-millisecond latency.

This introduces the possibility of conflicts (two regions write the same key simultaneously). Resolution strategy: last-writer-wins using a logical timestamp (Lamport clock or hybrid logical clock). For cache data, this is acceptable — the worst case is a slightly stale value that will be refreshed on the next cache miss.

Capacity Planning

Parameter	Value	Rationale
Keys	100 million	Typical for a large-scale web application
Average value size	500 bytes	JSON fragments, serialized objects
Raw data	~50 GB	100M x 500B
Metadata overhead	~2.4 GB	100M x 24B per entry
Memory per node	16 GB usable	64 GB physical, ~25% usable after OS + fragmentation buffer
Node count	4 primaries + 4 replicas	~13 GB data per primary, with headroom for growth
Throughput per node	250K ops/sec	Single-threaded event loop with io_uring or epoll
Total throughput	1M ops/sec	4 primary nodes x 250K ops/sec

Scoring Tips

To score well on a distributed cache design question, keep these principles in mind:

Start with requirements, not solutions. State throughput targets (1M ops/sec), latency constraints (sub-ms p99), and memory budget before proposing any architecture. This demonstrates that your design is driven by constraints.
Explain consistent hashing deeply. Do not just name it — walk through what happens when a node is added. Mention virtual nodes and explain why they solve the non-uniformity problem. This is the core algorithm of the system.
Address eviction with specifics. Saying "use LRU" is surface-level. Explain the doubly-linked list + hash map implementation, its O(1) complexity, and compare with LFU for workloads with frequency skew. Bonus: mention Redis's approximate LRU (sampled eviction) as a practical optimization.
Proactively discuss hot keys. Most candidates forget about non-uniform access patterns. Naming the hot key problem and offering local caching, key replication, and read-from-replica as mitigations shows production awareness.
Name the thundering herd. This is a classic failure mode that separates textbook answers from real-world experience. Offer the locking approach as a baseline and XFetch or stale-while-revalidate as advanced alternatives.
Discuss trade-offs explicitly. Async replication trades durability for latency. Eventual consistency trades correctness for availability. Slab allocation trades internal fragmentation for allocation speed. Interviewers want to see that you understand these are deliberate choices, not oversights.
Mention operational concerns. Failure detection (gossip protocol), memory fragmentation monitoring, and graceful node decommissioning show you think about Day 2 operations, not just Day 1 design.

Practice delivering this design end-to-end in under 35 minutes. Focus on smooth transitions between stages and be ready for deep-dive follow-ups on any subsystem. Tools like Hoppers AI let you rehearse system design interviews with real-time feedback on structure, depth, and pacing — helping you refine your delivery before the real thing.