Databases & Data Modeling — System Design Interviews

Why Databases Matter in System Design Interviews

Every system design interview eventually arrives at the same question: where does the data live? Your choice of database is not a throwaway detail. It shapes your entire architecture — query performance, consistency guarantees, operational complexity, and how gracefully the system scales. Interviewers use your data model to gauge whether you truly understand the system you are designing or are just sketching boxes on a whiteboard.

This guide gives you a decision framework you can apply to any problem. We will cover the major database families, schema design trade-offs, indexing strategies, consistency models, and how to present all of it clearly during an interview.

SQL vs NoSQL: A Decision Framework

The SQL-versus-NoSQL debate is not about which technology is "better." It is about which set of trade-offs aligns with your workload. Here is a framework you can articulate in an interview:

Choose a Relational (SQL) Database When

Your data has complex relationships — joins across multiple entities are frequent and essential (e.g., an e-commerce order system linking users, products, inventory, and payments).
You need strong consistency — financial transactions, booking systems, or anything where two concurrent writes must never produce an inconsistent state.
Your schema is well-defined and unlikely to change dramatically — relational databases reward upfront schema design with powerful query flexibility.
You need ad-hoc queries — SQL's declarative nature lets analysts and developers ask questions of the data that were not anticipated at design time.

Choose a NoSQL Database When

Your access patterns are known and narrow — you query by a primary key or a small set of keys, and you rarely need ad-hoc joins.
You need horizontal scalability — the dataset will grow beyond what a single node can handle, and you want the database to shard transparently.
Your schema evolves rapidly — document stores handle heterogeneous records gracefully.
Latency at scale matters more than transactional guarantees — a social media feed can tolerate a few seconds of staleness, but it cannot tolerate 500ms reads.

Interview tip: Never say "I'd use NoSQL because it's faster." That is too vague. Instead say: "Our access pattern is a single-key lookup by userId, we expect 50 million users, and we do not need cross-entity joins — so a key-value store like DynamoDB gives us single-digit millisecond reads at any scale without managing sharding ourselves."

Database Types at a Glance

Type	Examples	Data Model	Best For	Watch Out For
Relational	PostgreSQL, MySQL, Aurora	Tables, rows, foreign keys	Complex queries, ACID transactions, reporting	Vertical scaling limits, schema migrations at scale
Key-Value	Redis, DynamoDB, Memcached	Key → value (opaque blob or structured)	Session stores, caches, user profiles, high-throughput lookups	No joins, limited query flexibility
Document	MongoDB, Firestore, CouchDB	JSON-like documents in collections	Content management, catalogs, rapidly evolving schemas	Lack of joins can lead to data duplication
Wide-Column	Cassandra, HBase, ScyllaDB	Row key → column families	Time-series, IoT telemetry, write-heavy workloads	Requires careful partition key design, eventual consistency by default
Graph	Neo4j, Amazon Neptune, Dgraph	Nodes and edges with properties	Social networks, recommendation engines, fraud detection	Not suited for bulk analytics or simple key lookups
Search / Full-Text	Elasticsearch, OpenSearch	Inverted index over documents	Full-text search, log analytics, faceted filtering	Not a source of truth — sync lag, eventual consistency

Schema Design: Normalization vs Denormalization

Normalization

Normalization eliminates data redundancy by splitting data into related tables. A normalized e-commerce schema might have separate tables for users, orders, order_items, and products, linked by foreign keys.

Pros: No data duplication, easier updates (change a product name in one place), strong referential integrity.

Cons: Reads require joins. At massive scale, joins across sharded tables become expensive or impractical.

Denormalization

Denormalization intentionally duplicates data to avoid joins at read time. Instead of joining orders with products, you store the product name and price directly on the order item record.

Pros: Reads are fast — a single query fetches everything. Works naturally with NoSQL databases and sharded systems.

Cons: Writes become more complex (you must update every copy), and storage usage increases.

Real-world example: Instagram's feed is heavily denormalized. When a user posts a photo, the metadata is fanned out to every follower's feed. This makes reads trivially fast (one key lookup per user), even though writes are amplified.

The Single-Table Design Pattern

In DynamoDB and similar key-value stores, a powerful technique is the single-table design: storing multiple entity types in one table, differentiated by key prefixes. For example, a table might hold users (PK=USER#123, SK=PROFILE), their orders (PK=USER#123, SK=ORDER#2024-01-15), and order items (PK=ORDER#abc, SK=ITEM#1) all in the same table. This eliminates cross-table joins entirely and allows you to fetch a user and all their recent orders in a single query. The trade-off is complexity: your application code must understand the key structure, and ad-hoc queries become difficult without Global Secondary Indexes (GSIs).

Access-Pattern-Driven Design

The golden rule of schema design — especially for NoSQL — is: start with your access patterns, not your entities. List every query your application will make, then design your tables, keys, and indexes to serve those queries efficiently.

For a chat application, you might identify these access patterns:

Get all messages in a conversation, ordered by time.
Get all conversations for a user, ordered by last activity.
Get unread message count per conversation.

A DynamoDB single-table design might use PK=CONV#{convId}, SK=MSG#{timestamp} for messages, and PK=USER#{userId}, SK=CONV#{convId} for the user's conversation list — with a lastActivity attribute to enable sorting.

Indexes: B-Tree, Hash, and LSM

An index is a data structure that trades write overhead and storage for faster reads. Choosing the right index type is a subtle but important detail in interviews.

B-Tree Indexes

The default index type in most relational databases. B-trees maintain sorted order, so they excel at range queries (WHERE created_at BETWEEN ...), prefix searches, and ordered scans.

Read performance: O(log n) lookups, efficient range scans.
Write overhead: Every insert or update must maintain the tree structure, which involves random I/O.
Best for: OLTP workloads with a mix of point lookups and range queries.

Hash Indexes

Hash indexes map a key directly to a location via a hash function. They provide O(1) point lookups but cannot serve range queries at all.

Read performance: O(1) for exact-match lookups.
Write overhead: Low — just compute the hash and write.
Best for: Equality checks on high-cardinality columns (e.g., looking up a session by token).

LSM-Tree Indexes

Log-Structured Merge trees are the backbone of write-optimized stores like Cassandra, RocksDB, and LevelDB. Writes go to an in-memory buffer (memtable), which is periodically flushed to sorted on-disk files (SSTables) and compacted in the background.

Read performance: Slower than B-trees for point reads (may check multiple SSTables), mitigated by bloom filters.
Write performance: Excellent — sequential writes only, no random I/O.
Best for: Write-heavy workloads like event logging, time-series ingestion, and messaging systems.

Interview tip: If your system is write-heavy (event streaming, analytics ingestion), mention LSM-trees and explain why sequential writes outperform random I/O. This demonstrates depth beyond "just add an index."

ACID vs BASE: Consistency Models

ACID (Atomicity, Consistency, Isolation, Durability)

ACID transactions guarantee that a group of operations either all succeed or all fail, the database moves from one valid state to another, concurrent transactions do not interfere, and committed data survives crashes.

Relational databases provide ACID by default, though the strength of the Isolation guarantee varies by isolation level:

Read Uncommitted: Fastest, but allows dirty reads.
Read Committed: Prevents dirty reads. Default in PostgreSQL.
Repeatable Read: Prevents non-repeatable reads. Default in MySQL/InnoDB.
Serializable: Strongest. Transactions behave as if executed one at a time. Highest contention cost.

BASE (Basically Available, Soft state, Eventually consistent)

Distributed NoSQL systems often adopt BASE semantics. The system prioritizes availability and partition tolerance (the "AP" side of the CAP theorem) and accepts that reads may return stale data for a short window.

When eventual consistency is acceptable:

Social media feeds — a post appearing 2 seconds late is invisible to the user.
Product catalog — a price update propagating over a few seconds is fine.
Analytics counters — approximate counts are sufficient.

When it is not:

Bank transfers — double-spending is catastrophic.
Inventory reservation — overselling physical goods is expensive.
Distributed locking — stale reads break mutual exclusion.

Real-world example: Amazon's shopping cart (the original DynamoDB use case) uses eventual consistency for availability. If two data centers conflict, the cart merges both versions so items are never silently lost — "add to cart" always wins.

Access Patterns: Let Them Drive Your Design

Before choosing any database, enumerate your access patterns explicitly. This is the single most impactful habit you can build for system design interviews.

Step-by-Step Process

List all read and write operations — e.g., "Get user profile by userId," "List orders by userId sorted by date," "Increment view count for a video."
Classify each by frequency and latency requirement — a user profile lookup happens on every page load (high frequency, low latency). A monthly report runs once (low frequency, latency tolerant).
Identify the dominant pattern — if 90% of operations are single-key lookups, a key-value store is the natural fit. If most queries involve joins across three tables, a relational database is more appropriate.
Design keys and indexes to serve the top patterns — in DynamoDB, this means choosing partition keys that distribute load evenly and sort keys that enable the range queries you need. In PostgreSQL, this means creating composite indexes that cover your most frequent WHERE clauses.

Example: URL Shortener

Access patterns for a URL shortener:

Write: Create short URL → store {shortCode, longUrl, userId, createdAt, clickCount}.
Read (hot path): Resolve short URL → look up by shortCode (millions of times per day).
Read (cold path): List all URLs for a user → query by userId, sorted by createdAt.
Write (hot path): Increment click count on every redirect.

The dominant pattern is a point read by shortCode. A key-value store (Redis for the hot cache, DynamoDB for durable storage) is the natural choice. The click counter is write-heavy — an atomic counter or a separate write-optimized store avoids contention on the main record.

Data Lifecycle: TTL, Archival, and Tiered Storage

Production systems generate data continuously, and not all of it remains equally valuable. A mature data model accounts for the full lifecycle.

TTL (Time to Live)

Many databases support automatic expiration. DynamoDB deletes items with expired TTL attributes in the background. Redis evicts keys past their TTL. Use TTL for:

Session tokens and OTP codes (expire after 15 minutes).
Rate-limiting counters (reset every hour).
Cache entries (invalidate after a configurable window).

Hot / Warm / Cold Storage

Not every record needs single-digit millisecond access. A common pattern:

Hot: Recent data in a fast store (Redis, DynamoDB). Serves real-time queries.
Warm: Older data in a cost-effective database (S3 + Athena, Aurora). Serves dashboards and reports.
Cold: Archived data in deep storage (S3 Glacier, long-term backups). Serves compliance and disaster recovery.

Real-world example: Uber moves completed trip data from their operational database to a data lake after 48 hours. Real-time systems only query recent trips, while analytics teams query the full history in the lake. This keeps the operational database small and fast.

Archival Strategies

When designing a system, mention how old data moves between tiers:

DynamoDB Streams + Lambda: Capture deletes or TTL expirations and write to S3 for archival.
Change Data Capture (CDC): PostgreSQL logical replication or Debezium streams changes to a data lake.
Scheduled jobs: A nightly batch job moves records older than N days to cold storage.

Presenting Your Data Model in an Interview

In a structured system design interview, Stage 3 (data model) is where you translate requirements into concrete storage decisions. Here is how to present it effectively:

1. Start with Entities and Relationships

List the core entities and how they relate. For a ride-sharing system: User, Driver, Ride, Payment, Location. Sketch the relationships briefly — a Ride belongs to one User and one Driver.

2. State Your Access Patterns

Before choosing a database, explicitly say: "The dominant access patterns are: (1) look up a driver's current location by driverId, (2) find available drivers within a geographic radius, (3) retrieve ride history for a user sorted by date." This grounds your decision in requirements, not preference.

3. Choose and Justify Your Database

Map each access pattern to a storage solution:

Driver locations → Redis with geospatial indexes (GEOADD/GEOSEARCH) for real-time proximity queries.
Ride history → DynamoDB with PK=USER#{userId}, SK=RIDE#{timestamp} for efficient time-ordered retrieval.
Payments → PostgreSQL for ACID guarantees on financial transactions.

4. Show Your Key Schema

Write out a few example records with actual keys and attributes. This makes your design tangible and demonstrates that you have thought through the details.

5. Address Scale and Lifecycle

Mention how the data grows, how you will handle hot/cold separation, and what gets TTL'd. This shows production-level thinking that separates strong candidates from average ones.

Interview tip: Interviewers are impressed when you explicitly trade off two viable options. For example: "We could use PostgreSQL with PostGIS for location queries, but at 1 million active drivers updating every 5 seconds, the write volume favors Redis with its in-memory geospatial index. We accept the trade-off of eventual consistency on location data — a driver's position being 5 seconds stale is acceptable for matching."

Replication and Partitioning

Understanding how databases distribute data is essential when your system outgrows a single machine.

Replication

Replication copies data across multiple nodes for fault tolerance and read scalability. The three main strategies are:

Single-leader: One node accepts writes, replicas serve reads. Simple but the leader is a bottleneck and single point of failure during failover. Used by PostgreSQL streaming replication and MySQL.
Multi-leader: Multiple nodes accept writes, changes are replicated asynchronously. Useful for multi-region deployments but introduces write conflicts that must be resolved (last-write-wins, application-level merge, or CRDTs).
Leaderless: Any node accepts reads and writes. Quorum-based consistency (read from R nodes, write to W nodes, where R + W > N). Used by Cassandra and DynamoDB. Highly available but requires careful tuning of consistency levels.

Partitioning (Sharding)

Partitioning splits data across nodes so no single node holds everything. The two main approaches are:

Hash partitioning: A hash function maps keys to partitions. Distributes load evenly but destroys key ordering, making range queries expensive. Used by DynamoDB and Cassandra (with virtual nodes).
Range partitioning: Keys are divided into contiguous ranges assigned to partitions. Preserves ordering for efficient range scans but risks hot spots if access patterns cluster around certain key ranges (e.g., time-based keys where all recent writes hit the same partition).

In an interview, mention your partitioning strategy and explain how your partition key distributes load. A common mistake is choosing a low-cardinality partition key (like country) that creates hot partitions.

Common Mistakes to Avoid

Choosing a database before stating access patterns. This makes your decision look arbitrary. Always lead with the workload.
Using a single database for everything. Real systems are polyglot. A cache, a primary store, and a search index often coexist.
Ignoring write amplification in denormalized designs. If you denormalize, explain how you handle updates to duplicated data (async fan-out, eventual consistency, or accepting staleness).
Forgetting about secondary indexes. If you need to query DynamoDB by a non-key attribute, you need a GSI. If you need full-text search over PostgreSQL, you need GIN indexes or an external search engine.
Skipping data lifecycle. An ever-growing table is a ticking time bomb. Mention TTL, archival, or partitioning to show operational maturity.

Key Takeaways

Start every data model discussion with access patterns, not technology preferences.
Use the SQL vs NoSQL decision framework to justify your choice with specific trade-offs.
Understand index types (B-tree, hash, LSM) well enough to explain when each shines.
Know when ACID is non-negotiable and when eventual consistency is the pragmatic choice.
Design for the full data lifecycle — not just day one, but year three.
Present your data model concretely — show keys, attributes, and example records.

Databases and data modeling are foundational to every system design problem. Master the frameworks in this guide, and you will handle Stage 3 of any interview with confidence. For hands-on practice with system design interviews — including real-time feedback on your data modeling decisions — try a mock session on Hoppers AI.