We use cookies for analytics to improve our service. See our Privacy Policy.

    Sign up free to unlock interview prep materials and a free mock interview for your next role.

    Start Free
    system design
    question breakdown
    distributed cache

    Design a Distributed Cache

    Hoppers AI Team·March 12, 2026·12 min read

    Design a Distributed Cache — Complete System Design Walkthrough

    Distributed caches are the backbone of modern high-performance systems. Every major tech company — from Amazon to Netflix — relies on in-memory caching layers to achieve the sub-millisecond response times users expect. Designing one from scratch is a classic system design interview question because it tests your understanding of hashing, replication, memory management, and failure handling. In this walkthrough, we will design a distributed cache system end-to-end, following a structured 6-stage approach.

    Client AClient BClient CCache Proxy(Router / LB)ConsistentHash RingVN1VN2VN3VN4VN5VN6Cache Node 1(Primary)Cache Node 2(Primary)Cache Node 3(Primary)Replica Pool(Async replication)SynchronousAsynchronousDistributed Cache — High-Level Architecture

    Stage 1: Requirements Gathering

    Begin every system design interview by clarifying scope. Spend 3-5 minutes establishing what you are building and at what scale. This is where strong candidates separate themselves.

    Functional Requirements

    • Core operationsGET(key), SET(key, value, ttl), DELETE(key). These are the primitives every cache must support.
    • TTL (Time-To-Live) — Keys expire automatically after a configurable duration. Clients can set per-key TTLs on write.
    • Eviction policies — When memory is full, the cache must evict entries. Support at least LRU (Least Recently Used) and LFU (Least Frequently Used).
    • Batch operationsMGET and MSET for retrieving or writing multiple keys in a single round-trip to reduce network overhead.
    • Key enumeration — Ability to scan keys by prefix or pattern (non-blocking, cursor-based).

    Non-Functional Requirements

    • Throughput: ~1 million operations per second across the cluster.
    • Latency: Sub-millisecond p99 for single-key GET/SET operations within a data center.
    • Availability: 99.99% uptime. Cache unavailability cascades into database overload.
    • Scalability: Horizontally scalable — add nodes to increase capacity without downtime.
    • Consistency: Eventual consistency for replicas is acceptable. Single-key operations on the primary are linearizable.
    • Memory efficiency: Minimize overhead per key. With 100 million keys averaging 500 bytes each, we need approximately 50 GB of usable cache memory across the cluster.
    Interview tip: Explicitly state your scale numbers. Saying "1 million ops/sec" gives you concrete throughput targets to reference when sizing nodes later. It also signals to the interviewer that you reason from constraints, not intuition.

    Stage 2: API Design

    A distributed cache exposes a simple interface, but the API must account for batch operations, conditional writes, and configuration. Define the external contract before thinking about internals.

    Core Cache Operations

    OperationSignatureDescription
    GETGET keyReturns value or null if key does not exist or has expired.
    SETSET key value [EX seconds] [PX milliseconds] [NX|XX]Sets key to value with optional TTL. NX: only if key does not exist. XX: only if key exists.
    DELETEDELETE keyRemoves the key and its value.
    MGETMGET key1 key2 ... keyNReturns values for multiple keys in a single round-trip.
    MSETMSET key1 value1 key2 value2 ...Sets multiple key-value pairs atomically on each node.
    TTLTTL keyReturns remaining time-to-live in seconds (-1 if no TTL, -2 if key missing).
    SCANSCAN cursor [MATCH pattern] [COUNT hint]Cursor-based key enumeration. Non-blocking, returns a batch of keys and the next cursor.

    Conditional Writes

    The NX and XX flags on SET enable two important patterns:

    • Distributed lockingSET lock_key owner_id EX 30 NX acquires a lock only if it does not already exist, with automatic expiry as a deadlock safety net.
    • Cache refreshSET key new_value XX updates an existing entry without creating one if the key has been evicted.

    Client Configuration

    Clients need to know the cluster topology. A configuration endpoint or service discovery mechanism provides the list of cache nodes, their hash ranges, and health status. On topology changes (node added/removed), the client library receives updated mappings and re-routes requests accordingly.

    Design decision: We use a text-based protocol (like Redis's RESP) rather than HTTP. HTTP headers add hundreds of bytes of overhead per request. For a cache that serves millions of sub-millisecond requests per second, protocol overhead directly impacts throughput. RESP is simple, parseable, and adds minimal bytes per command.

    Stage 3: Data Model

    The internal data structures of a cache node determine its performance characteristics. Every microsecond matters when targeting sub-millisecond latency.

    Key-Value Storage

    Each cache node stores entries in a hash table as the primary index. The hash table maps keys to value pointers and metadata:

    FieldTypeSizeDescription
    keybyte[]variableThe cache key (max 512 bytes)
    valuebyte[]variableThe cached value (max 1 MB)
    ttl_expiryuint648 bytesAbsolute expiry timestamp in milliseconds (0 = no expiry)
    last_accesseduint648 bytesTimestamp of last GET/SET (used for LRU)
    access_countuint324 bytesApproximate access frequency (used for LFU)
    sizeuint324 bytesTotal bytes consumed (key + value + metadata)

    The per-entry metadata overhead is approximately 24 bytes. With 10 million keys per node, metadata accounts for ~240 MB — a reasonable trade-off for supporting both LRU and LFU eviction.

    Hash Slots

    The total keyspace is divided into a fixed number of hash slots (e.g., 16,384, as in Redis Cluster). Each key is mapped to a slot via CRC16(key) mod 16384. Slots are then assigned to nodes. This indirection layer makes rebalancing possible: moving a slot from one node to another is a bounded operation that affects only the keys in that slot.

    Eviction Data Structures

    • LRU: A doubly-linked list ordered by access time, combined with hash table pointers. On each access, the entry is moved to the head. Eviction removes from the tail. O(1) for both operations.
    • LFU: A min-heap or frequency-bucketed linked list. Redis uses a probabilistic counter (8-bit logarithmic counter with decay) to approximate frequency without maintaining exact counts.
    • TTL expiry: A min-heap (priority queue) keyed by expiry timestamp. A background thread samples entries periodically and evicts expired ones. Lazy expiry also checks TTL on every GET and returns null if expired.

    Memory Allocator

    General-purpose allocators (like glibc malloc) suffer from fragmentation with cache workloads. Use a slab allocator that pre-allocates memory in fixed-size classes (64B, 128B, 256B, 1KB, 4KB, etc.). Each value is stored in the smallest slab class that fits. This trades some internal fragmentation for deterministic allocation performance and reduced external fragmentation.

    Stage 4: High-Level Architecture

    The architecture must distribute data across multiple nodes while maintaining sub-millisecond access. Three core problems drive the design: partitioning, routing, and replication.

    Consistent Hashing

    We use consistent hashing to assign hash slots to nodes. Nodes are placed on a hash ring based on their identifier. Each key hashes to a position on the ring and is stored on the first node encountered clockwise from that position.

    Why consistent hashing over simple modular hashing (hash(key) % N)? When a node is added or removed, modular hashing remaps nearly every key, causing a cache avalanche. Consistent hashing only remaps keys that belong to the affected portion of the ring — roughly K/N keys, where K is the total number of keys and N is the number of nodes.

    Partition Assignment

    The 16,384 hash slots are distributed evenly across nodes. With 5 nodes, each node owns approximately 3,277 slots. The slot-to-node mapping is stored in a cluster configuration that all clients and nodes share. When a node joins or leaves, slots are redistributed and the configuration is propagated.

    Replication

    Each primary node has one or more replica nodes. Replication is asynchronous — the primary acknowledges writes immediately and streams updates to replicas in the background. This preserves sub-millisecond write latency at the cost of potential data loss on primary failure (a small window of unreplicated writes).

    Replicas serve two purposes:

    • Failover: If a primary dies, a replica is promoted. The cluster remains available with minimal disruption.
    • Read scaling: Clients can optionally read from replicas for read-heavy workloads, accepting slightly stale data.

    Request Routing

    There are three common routing strategies:

    1. Client-side routing: The client library knows the slot map and sends each request directly to the correct node. Lowest latency (one hop), but requires smart clients.
    2. Proxy-based routing: A stateless proxy layer accepts all requests, hashes the key, and forwards to the correct node. Simpler clients, but adds a network hop (~0.1-0.3ms).
    3. Redirect-based: The client sends to any node. If that node does not own the slot, it returns a redirect. The client retries at the correct node. Simple but adds latency on misses.

    For our design, we use client-side routing as the primary mode (for performance-sensitive applications) with a proxy layer available for simpler clients. This mirrors how Redis Cluster operates in practice.

    Stage 5: Deep Dive

    We will go deep on two critical topics: consistent hashing with virtual nodes and hot key handling with thundering herd prevention.

    Deep Dive 1: Consistent Hashing with Virtual Nodes

    Naive consistent hashing — placing each physical node at a single point on the ring — leads to uneven data distribution. If nodes happen to cluster together on the ring, some nodes store far more keys than others. This problem worsens as nodes are added and removed.

    Virtual Nodes

    The solution is virtual nodes (vnodes). Each physical node is mapped to multiple points on the ring — typically 100 to 200 virtual nodes per physical node. The key is hashed to a position on the ring and assigned to the nearest virtual node clockwise, which maps back to a physical node.

    Benefits of virtual nodes:

    • Uniform distribution: With 150+ vnodes per node, the standard deviation of keys per node drops to under 5% of the mean.
    • Heterogeneous hardware: A node with 64 GB of RAM can be assigned more vnodes than a node with 32 GB, achieving proportional data distribution without special logic.
    • Smoother rebalancing: When a node leaves, its vnodes are scattered across the ring. The keys from each vnode are picked up by different successor nodes, spreading the load increase evenly rather than dumping everything onto a single neighbor.

    Hash Function Choice

    The hash function must be fast and produce uniform distribution. xxHash or MurmurHash3 are common choices — both compute in under 100 nanoseconds for typical key sizes and produce well-distributed 64-bit hashes. Avoid cryptographic hashes (SHA-256) — they are 10x slower and the security properties are unnecessary for slot assignment.

    Handling Node Addition

    When a new node joins the cluster:

    1. The new node is assigned its set of vnodes on the ring.
    2. For each vnode, the new node becomes the owner of some keys that previously belonged to the next node clockwise.
    3. A background slot migration process transfers the affected keys from the old owner to the new node.
    4. During migration, the old node continues serving reads for migrating slots and redirects writes to the new owner. This ensures zero downtime.
    5. Once migration is complete, the cluster configuration is updated and propagated.

    Deep Dive 2: Hot Key Handling and Thundering Herd

    In real-world systems, key access is rarely uniform. A viral tweet, a flash sale product, or a popular configuration value can concentrate millions of requests per second on a single key — a hot key. Since one key maps to one slot on one node, that node becomes a bottleneck.

    Detecting Hot Keys

    Each node maintains a probabilistic frequency counter (similar to a Count-Min Sketch) that tracks access frequency per key with minimal memory. Keys exceeding a configurable threshold (e.g., 5,000 requests/second) are flagged as hot. The proxy or client library can also track request distribution and identify hot keys client-side.

    Mitigating Hot Keys

    • Local caching: The client library or proxy maintains a small in-process cache (L1 cache) for hot keys. The L1 cache has a short TTL (1-5 seconds) and absorbs most reads without hitting the cache node. This is the most effective mitigation — it eliminates network round-trips entirely for hot keys.
    • Key replication across slots: Internally, the system can replicate a hot key's value to multiple nodes. The client library appends a random suffix (e.g., key#1, key#2, ..., key#8) and distributes reads across these copies. This spreads load across nodes but increases write complexity.
    • Read from replicas: For hot keys that are read-heavy, route reads to replicas instead of the primary. This scales read capacity linearly with the number of replicas.

    Thundering Herd (Cache Stampede)

    When a popular key expires, hundreds of threads may simultaneously discover the cache miss and all attempt to regenerate the value from the database — a thundering herd. The database sees a sudden spike that can cascade into an outage.

    Three complementary solutions:

    1. Lock-based recomputation: When a cache miss occurs, the first requestor acquires a distributed lock (SET regen:key owner NX EX 10). Subsequent requestors wait briefly and retry the cache read. Only one thread hits the database. This is the most common approach.
    2. Probabilistic early expiry (XFetch): Before the TTL expires, each request has a small probability of triggering a background refresh. The probability increases as the TTL approaches zero: shouldRefresh = (currentTime - (expiry - ttl * beta * log(rand()))) > 0. This ensures one requestor refreshes the key just before it expires, while all others continue reading the existing value.
    3. Stale-while-revalidate: The cache stores a soft_ttl and a hard_ttl. After soft_ttl, the value is considered stale but still serveable. One background thread refreshes the value. The hard_ttl is the true expiry. This guarantees that the cache always has a value, even if slightly stale.
    Interview tip: Naming the thundering herd problem and offering multiple mitigation strategies signals depth. Most candidates mention locking. Fewer mention XFetch or stale-while-revalidate. Covering all three and discussing trade-offs (latency vs. staleness vs. complexity) is a strong differentiator.

    Stage 6: Scaling and Operational Concerns

    Adding and Removing Nodes

    The cluster must scale without downtime. The process for adding a node:

    1. The new node registers with the cluster coordinator (a lightweight metadata service or gossip protocol).
    2. The coordinator computes a new slot assignment that rebalances load.
    3. Slot migration begins: for each slot moving to the new node, the source node streams all key-value pairs in that slot to the new node.
    4. During migration, reads for migrating slots are served by the source node. Writes go to both the source and the new node (dual-write) to prevent data loss.
    5. Once migration is complete, the configuration is updated atomically and propagated to all clients.

    Removing a node follows the same process in reverse: its slots are redistributed among remaining nodes before it is decommissioned.

    Memory Management

    Memory is the scarcest resource in a cache. Effective management requires multiple strategies:

    • Max memory policy: Each node has a configured maximum memory limit. When reached, the eviction policy (LRU, LFU, or random) kicks in. Never let the OS start swapping — swapped cache pages defeat the purpose of in-memory caching.
    • Memory fragmentation monitoring: Track the ratio of RSS (Resident Set Size) to actual data stored. A ratio above 1.5 indicates fragmentation. Trigger a background defragmentation pass or restart the node with data reloaded from replicas.
    • Large key detection: Keys with values exceeding 10 KB should be flagged and monitored. Large keys cause uneven memory distribution and can block the event loop during serialization. Consider compressing values above a threshold or splitting them into chunks.

    Failure Detection and Recovery

    The cluster uses a gossip protocol for failure detection. Each node periodically pings a random subset of other nodes. If a node fails to respond after multiple attempts, it is marked as PFAIL (possibly failed). When a majority of nodes agree on the failure, the node is marked as FAIL and failover begins.

    Failover process:

    1. The replica with the most up-to-date replication offset is elected as the new primary.
    2. The new primary takes ownership of all slots that belonged to the failed node.
    3. Clients are notified of the topology change and update their routing tables.
    4. Total failover time: typically 1-3 seconds (detection) + ~1 second (promotion) = 2-4 seconds of degraded availability.

    Cross-Data-Center Replication

    For global applications, deploy independent cache clusters in each data center with asynchronous cross-region replication. Each cluster is authoritative for its region. Writes are replicated to other regions with a delay (50-200ms depending on network distance). Reads are always served locally for sub-millisecond latency.

    This introduces the possibility of conflicts (two regions write the same key simultaneously). Resolution strategy: last-writer-wins using a logical timestamp (Lamport clock or hybrid logical clock). For cache data, this is acceptable — the worst case is a slightly stale value that will be refreshed on the next cache miss.

    Capacity Planning

    ParameterValueRationale
    Keys100 millionTypical for a large-scale web application
    Average value size500 bytesJSON fragments, serialized objects
    Raw data~50 GB100M x 500B
    Metadata overhead~2.4 GB100M x 24B per entry
    Memory per node16 GB usable64 GB physical, ~25% usable after OS + fragmentation buffer
    Node count4 primaries + 4 replicas~13 GB data per primary, with headroom for growth
    Throughput per node250K ops/secSingle-threaded event loop with io_uring or epoll
    Total throughput1M ops/sec4 primary nodes x 250K ops/sec

    Scoring Tips

    To score well on a distributed cache design question, keep these principles in mind:

    • Start with requirements, not solutions. State throughput targets (1M ops/sec), latency constraints (sub-ms p99), and memory budget before proposing any architecture. This demonstrates that your design is driven by constraints.
    • Explain consistent hashing deeply. Do not just name it — walk through what happens when a node is added. Mention virtual nodes and explain why they solve the non-uniformity problem. This is the core algorithm of the system.
    • Address eviction with specifics. Saying "use LRU" is surface-level. Explain the doubly-linked list + hash map implementation, its O(1) complexity, and compare with LFU for workloads with frequency skew. Bonus: mention Redis's approximate LRU (sampled eviction) as a practical optimization.
    • Proactively discuss hot keys. Most candidates forget about non-uniform access patterns. Naming the hot key problem and offering local caching, key replication, and read-from-replica as mitigations shows production awareness.
    • Name the thundering herd. This is a classic failure mode that separates textbook answers from real-world experience. Offer the locking approach as a baseline and XFetch or stale-while-revalidate as advanced alternatives.
    • Discuss trade-offs explicitly. Async replication trades durability for latency. Eventual consistency trades correctness for availability. Slab allocation trades internal fragmentation for allocation speed. Interviewers want to see that you understand these are deliberate choices, not oversights.
    • Mention operational concerns. Failure detection (gossip protocol), memory fragmentation monitoring, and graceful node decommissioning show you think about Day 2 operations, not just Day 1 design.

    Practice delivering this design end-to-end in under 35 minutes. Focus on smooth transitions between stages and be ready for deep-dive follow-ups on any subsystem. Tools like Hoppers AI let you rehearse system design interviews with real-time feedback on structure, depth, and pacing — helping you refine your delivery before the real thing.