Design a Job Scheduler — Complete System Design Walkthrough
Job schedulers are the backbone of every large-scale system. From sending millions of reminder emails at precise times, to running nightly ETL pipelines, to retrying failed payments on a schedule — a reliable, distributed job scheduler is critical infrastructure. This question tests your ability to design for time-based triggers, distributed coordination, exactly-once semantics, and graceful failure handling. In this walkthrough, we will design one from scratch across all six stages of a structured system design interview.
Stage 1: Requirements Gathering
Start by clarifying scope. A job scheduler can mean many things — from a simple cron replacement to a full workflow orchestration engine. Pin down the boundaries with your interviewer before drawing anything.
Functional Requirements
- Create and schedule jobs — Users can submit a job with a handler (HTTP callback URL or function identifier), a payload, and a trigger time.
- One-time and recurring jobs — Support both single-fire jobs ("run at 3pm tomorrow") and cron-style recurring schedules ("every Monday at 9am").
- Cancel jobs — Users can cancel a pending or recurring job before its next execution.
- Job priorities — Support at least three priority levels (high, normal, low). Higher-priority jobs are dequeued before lower-priority ones when workers are contended.
- Job dependencies — A job can declare dependencies on other jobs. It will not execute until all dependencies have completed successfully, forming a directed acyclic graph (DAG).
- Execution status — Users can query the current state of a job (pending, running, completed, failed).
- Retry on failure — Failed jobs are retried with exponential backoff up to a configurable maximum retry count.
Non-Functional Requirements
- Scale: 10 million jobs scheduled per day, with up to 100,000 executing concurrently at peak.
- Trigger accuracy: Jobs must fire within 1 second of their scheduled time under normal load.
- Exactly-once execution: A job must not be executed more than once per trigger (duplicate execution could mean double-charging a customer or sending duplicate emails).
- Durability: Scheduled jobs must survive server crashes. No job should be silently lost.
- Availability: 99.99% uptime — the scheduler is critical infrastructure. If it goes down, downstream systems fail silently.
Interview tip: Stating "10 million jobs per day" translates to roughly 115 jobs per second on average, but job schedules are bursty — expect peaks of 5-10x at the top of every hour and minute. Call this out explicitly to show you understand real-world traffic patterns.
Stage 2: API Design
The Job Scheduler exposes a REST API for job management. Keep it simple and resource-oriented.
Core Endpoints
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /v1/jobs | Create and schedule a new job |
| GET | /v1/jobs?status=pending&limit=50&cursor= | List jobs with filtering and pagination |
| GET | /v1/jobs/{jobId} | Get job details and current status |
| DELETE | /v1/jobs/{jobId} | Cancel a pending or recurring job |
| GET | /v1/jobs/{jobId}/executions | Get execution history for a job |
Create Job Request
The create endpoint is the most important. Here is a representative payload:
| Field | Type | Description |
|---|---|---|
name | string | Human-readable job name |
handler | string | Callback URL or registered function ID |
payload | JSON object | Arbitrary data passed to the handler at execution time |
scheduleAt | ISO 8601 timestamp | When to first trigger the job (for one-time jobs) |
cronExpression | string (optional) | Cron expression for recurring jobs (e.g., 0 9 * * MON) |
priority | enum | high, normal, low (default: normal) |
maxRetries | integer | Maximum retry attempts on failure (default: 3) |
dependsOn | array of jobIds | Jobs that must complete before this one can run |
timeoutSeconds | integer | Maximum execution duration before the job is killed (default: 300) |
The response returns a jobId, the computed nextRunAt timestamp, and the initial status of pending.
Design decision: We use a callback-based execution model (the scheduler calls a URL when the job fires) rather than embedding execution logic. This keeps the scheduler generic — it does not need to know what jobs do. The handler could be a Lambda function, a Kubernetes pod, or a simple webhook endpoint.
Stage 3: Data Model
We need two primary tables and a well-defined state machine for job lifecycle management.
Jobs Table
| Column | Type | Purpose |
|---|---|---|
job_id | UUID (PK) | Unique identifier |
name | string | Human-readable name |
handler | string | Callback URL or function ID |
payload | JSON | Data passed to handler |
schedule_type | enum | one_time or recurring |
cron_expression | string (nullable) | Cron expression for recurring jobs |
next_run_at | timestamp | Next scheduled execution time (indexed) |
status | enum | pending, running, completed, failed, cancelled |
priority | integer | 0 (high), 1 (normal), 2 (low) |
max_retries | integer | Maximum retry attempts |
retry_count | integer | Current retry attempt number |
timeout_seconds | integer | Execution timeout |
depends_on | array of UUIDs | Parent job IDs (DAG edges) |
partition_key | integer | Scheduler partition assignment (0..N-1) |
locked_by | string (nullable) | Scheduler instance ID holding the lock |
locked_until | timestamp (nullable) | Lock expiration (lease-based) |
created_at | timestamp | |
updated_at | timestamp |
Execution History Table
| Column | Type | Purpose |
|---|---|---|
execution_id | UUID (PK) | Unique execution identifier |
job_id | UUID (FK, indexed) | Parent job |
attempt | integer | Which attempt (1, 2, 3...) |
status | enum | running, completed, failed, timed_out |
started_at | timestamp | |
finished_at | timestamp (nullable) | |
error_message | text (nullable) | Failure reason |
worker_id | string | Which worker executed this attempt |
State Machine
Every job follows a strict state machine that prevents invalid transitions:
- pending — Job is scheduled and waiting for its trigger time. Transitions to
runningwhen picked up by a worker, orcancelledif the user cancels it. - running — Job is actively being executed by a worker. Transitions to
completedon success,failedon error, or back topendingon timeout (if retries remain). - completed — Terminal state. For recurring jobs, a new execution cycle begins:
next_run_atis recomputed from the cron expression, and status resets topending. - failed — If
retry_count < max_retries, transitions back topendingwith an exponential backoff delay applied tonext_run_at. Otherwise, terminal — the job is moved to the dead letter queue. - cancelled — Terminal state. No further executions.
Key Indexes
- Composite index on
(partition_key, status, next_run_at)— This is the critical query path. Each scheduler partition polls for jobs wherestatus = 'pending'andnext_run_at <= NOW()within its assigned partition. The composite index makes this a fast range scan. - Index on
job_idin the execution history table — For querying execution history of a specific job.
Interview tip: Mentioning the composite index proactively shows the interviewer you are thinking about query performance at scale, not just logical correctness.
Stage 4: High-Level Architecture
The system has four major components: the Job API, the Scheduler, the Task Queue, and the Worker Pool. Let us trace the lifecycle of a job through each.
Job API
A stateless REST service behind a load balancer. It validates requests, writes jobs to the Job Store (PostgreSQL), and returns immediately. The API does not execute jobs — it only manages their metadata. For recurring jobs, it parses the cron expression and computes the initial next_run_at.
Scheduler (Time-Based Trigger)
The Scheduler is the heart of the system. It is responsible for detecting when a job's trigger time has arrived and enqueuing it for execution. There are two architectural approaches:
Approach 1: Polling. Each Scheduler instance periodically queries the Job Store for jobs where next_run_at <= NOW() and status = 'pending'. Polling interval is typically 500ms to 1s. This is simple, battle-tested, and works well up to moderate scale. The downside is wasted queries when no jobs are due, and a worst-case delay equal to the polling interval.
Approach 2: Delay queue / event-driven. When a job is created, the API pushes it into a delay queue (e.g., Redis sorted set with score = trigger timestamp, or a message broker with delayed delivery). The Scheduler consumes from this queue, receiving jobs exactly when they are due. This eliminates polling overhead but introduces a dependency on the delay queue's reliability and ordering guarantees.
We choose a hybrid approach: use a delay queue (Redis sorted set with ZRANGEBYSCORE) as the primary trigger mechanism, with periodic polling as a safety net to catch any jobs that might have been missed due to queue failures. This gives us sub-second trigger accuracy with a durable fallback.
Task Queue
Once the Scheduler determines a job is ready, it enqueues it into a priority task queue (e.g., RabbitMQ with priority queues, or a multi-queue setup with separate queues per priority level). The queue decouples the triggering decision from actual execution, allowing the worker pool to scale independently.
Worker Pool
Workers consume from the task queue, execute the job (by calling the handler URL with the payload), and report the result back. Workers are stateless and horizontally scalable. Each worker:
- Dequeues a job from the task queue.
- Creates an execution record with status
running. - Calls the handler URL with the job payload (HTTP POST with a timeout).
- On success: updates execution to
completed, updates job status. For recurring jobs, computes nextnext_run_at. - On failure: increments
retry_count. If retries remain, resets status topendingwith a backoff delay. Otherwise, marks the job asfailedand pushes it to the Dead Letter Queue.
Stage 5: Deep Dive
We will dive deep into two critical challenges: distributed locking for exactly-once execution and job dependencies with DAG execution.
Deep Dive 1: Distributed Locking for Exactly-Once Execution
With multiple Scheduler instances polling for due jobs, the same job could be picked up by two instances simultaneously, leading to duplicate execution. This is unacceptable for operations like payment processing or email delivery.
Lease-Based Locking
We use a lease-based locking mechanism directly in the database. When a Scheduler picks up a job, it performs an atomic conditional update:
UPDATE jobs SET status = 'running', locked_by = 'scheduler-3', locked_until = NOW() + INTERVAL '30 seconds' WHERE job_id = ? AND status = 'pending' AND (locked_by IS NULL OR locked_until < NOW())
This query succeeds for exactly one Scheduler instance due to the WHERE clause acting as an optimistic lock. If another instance tries the same update, the status is no longer pending and the update affects zero rows.
Why Lease-Based, Not Permanent Locks?
If a Scheduler acquires a lock and then crashes before enqueuing the job, a permanent lock would leave the job stranded forever. The lease (a time-bounded lock via locked_until) ensures that after 30 seconds, another Scheduler can reclaim the job. The lease duration should be significantly longer than the time needed to enqueue the job (typically under 100ms) but short enough to limit delay on recovery.
Redis as an Alternative Lock Store
For higher throughput, you can use Redis with the Redlock algorithm. The Scheduler calls SET job:{jobId}:lock scheduler-3 NX PX 30000 before processing. This is faster than a database round-trip and avoids contention on the jobs table. However, it introduces Redis as a critical dependency and requires careful handling of clock skew across Redis nodes. For most use cases, the database-level lease is simpler and sufficient.
End-to-End Exactly-Once Guarantee
The lock prevents duplicate triggering, but we also need to prevent duplicate execution at the worker level. The task queue should use acknowledgment-based delivery: a job is invisible to other workers while one is processing it, and only permanently removed from the queue when the worker sends an explicit ACK. If the worker crashes, the visibility timeout expires and the job becomes available for another worker. Combined with an idempotency key (the execution_id), the handler can detect and deduplicate retries.
Deep Dive 2: Job Dependencies and DAG Execution
Supporting job dependencies means allowing users to define a directed acyclic graph where a job only runs after all its upstream dependencies have completed successfully.
DAG Representation
Each job stores a depends_on array of parent job IDs. This implicitly defines a DAG. Before a job can transition from pending to running, we verify that all parent jobs have status completed. The Scheduler performs this check when evaluating a job:
SELECT COUNT(*) FROM jobs WHERE job_id = ANY(current_job.depends_on) AND status != 'completed'
If the count is zero, all dependencies are satisfied and the job can proceed. Otherwise, it remains in pending.
Triggering Downstream Jobs
When a job completes, we need to check whether any downstream jobs are now unblocked. We maintain a reverse dependency index — either as a separate table or as a query on the depends_on column:
SELECT job_id FROM jobs WHERE ? = ANY(depends_on) AND status = 'pending'
For each returned job, we re-evaluate its full dependency set. If all parents are completed, we update its next_run_at to NOW() so the Scheduler picks it up immediately.
Cycle Detection
When a job with dependencies is created, the API must validate that adding this edge does not create a cycle. We perform a depth-first traversal from the new job's dependencies, following each parent's depends_on chain. If we encounter the new job's ID during traversal, a cycle exists and we reject the request with a 400 Bad Request.
Failure Propagation
When a job in a DAG fails permanently (exhausts all retries), downstream jobs cannot proceed. We propagate the failure by transitioning all reachable downstream jobs to a blocked state (a sub-state of pending) and notifying the user. The user can then fix the root cause, manually retry the failed job, and the DAG resumes automatically.
Interview tip: When discussing DAG execution, draw a small example (A → B → D, A → C → D) on the whiteboard. It makes the dependency resolution algorithm concrete and shows the interviewer you can communicate abstract concepts visually.
Stage 6: Scaling and Trade-Offs
Partitioned Schedulers
A single Scheduler instance polling the entire jobs table will not scale to 10 million jobs per day. We partition the job space across multiple Scheduler instances:
- Each job is assigned a
partition_key(computed ashash(job_id) % Nwhere N is the number of partitions). - Each Scheduler instance owns a range of partitions and only polls for jobs within its assigned range.
- Partition assignment is managed by a coordination service (ZooKeeper or etcd). When a Scheduler instance joins or leaves, partitions are rebalanced.
With 16 partitions and 4 Scheduler instances, each instance handles approximately 25% of the job space. If one instance fails, its partitions are redistributed to the survivors. This is similar to how Kafka consumer groups manage partition assignment.
Worker Autoscaling
Worker demand is inherently bursty. At the top of every hour, thousands of cron jobs fire simultaneously. We implement autoscaling based on queue depth:
- Scale-up trigger: When the task queue depth exceeds a threshold (e.g., 1,000 pending items) for more than 30 seconds, add worker instances.
- Scale-down trigger: When workers are idle (no jobs dequeued for 5 minutes), remove instances gradually.
- Pre-scaling: For predictable traffic patterns (top-of-hour spikes), preemptively scale up at :58 every hour based on historical data.
If running on Kubernetes, use the KEDA (Kubernetes Event-Driven Autoscaler) to scale worker deployments based on queue length. On AWS, Lambda functions behind SQS provide automatic scaling with zero management overhead.
Failure Handling
There are three categories of failure, each handled differently:
Worker failure mid-execution. The task queue's visibility timeout ensures the job reappears after the timeout expires. Another worker picks it up. The retry_count increments, and the execution record shows a timed_out entry for the original attempt.
Scheduler failure. Partition reassignment happens within seconds via the coordination service. The lease-based lock on any in-progress jobs expires, and the new partition owner re-evaluates them. Worst case, a job fires 30 seconds late (the lease duration).
Database failure. This is the most severe. The Scheduler and API are both unavailable. Mitigation: use a multi-AZ PostgreSQL deployment with automatic failover (RDS Multi-AZ, or Patroni for self-managed). Failover typically completes in under 30 seconds. During the outage, the task queue continues processing already-enqueued jobs, limiting the blast radius.
Dead Letter Queue
Jobs that exhaust all retry attempts are moved to a Dead Letter Queue (DLQ). The DLQ serves multiple purposes:
- Visibility: Operators can inspect failed jobs, understand why they failed, and decide whether to fix and retry or discard.
- Alerting: A monitoring system watches the DLQ depth. If it exceeds a threshold, it fires an alert. A growing DLQ often indicates a systemic issue (downstream service outage, misconfigured handler).
- Manual retry: An admin API endpoint allows replaying jobs from the DLQ back into the main queue after the root cause is fixed.
Capacity Estimation
- Job storage: 10M jobs/day, average 1 KB per job = 10 GB/day of new data. With 90-day retention, that is ~900 GB. PostgreSQL handles this comfortably with proper indexing and partitioning by
created_at. - Queue throughput: Peak of ~1,000 jobs/second (10x average). RabbitMQ or SQS handle this easily. Kafka can handle orders of magnitude more if needed.
- Worker count: With 100K concurrent jobs and an average execution time of 5 seconds, we need ~100K / (1/5) = ~500 workers at peak. With autoscaling, baseline can be 50 workers scaling up to 500.
Trade-Off Summary
| Decision | Choice | Trade-Off |
|---|---|---|
| Trigger mechanism | Redis sorted set + DB polling fallback | Sub-second accuracy with durability guarantee. Redis failure degrades to polling (1s delay), not outage. |
| Locking | Database lease-based | Simpler than Redlock, but adds a write per job trigger. Acceptable at 1K/s peak. |
| Queue | RabbitMQ with priority queues | Native priority support. Less throughput than Kafka, but sufficient for our scale and simpler operationally. |
| Job store | PostgreSQL | Strong consistency, transactions for state machine transitions. Shard if we exceed 100K writes/second. |
| DAG execution | Reverse dependency lookup on completion | Adds query overhead on job completion, but keeps the data model simple. For complex DAGs (1000+ nodes), consider a dedicated graph store. |
| Worker scaling | Queue-depth autoscaling | Reactive, so there is a brief lag on sudden spikes. Pre-scaling at known peak times compensates. |
Scoring Tips
To score well on a job scheduler design question, keep these principles in mind:
- Clarify job types early. One-time vs. recurring vs. DAG-based jobs have very different implications for the data model and scheduler logic. Show the interviewer you recognize this.
- Exactly-once is the key challenge. Interviewers will probe on duplicate execution. Explain the lease-based lock mechanism clearly, including what happens when a lock holder crashes. Mention the visibility timeout on the task queue as the second layer of protection.
- Distinguish triggering from execution. The Scheduler decides when to run a job. The Worker decides how. Keeping these separate enables independent scaling and simpler failure handling.
- Address the bursty traffic pattern. Cron jobs cluster at the top of every hour and minute. If you do not mention this, the interviewer will ask. Pre-scaling and queue buffering are the standard answers.
- Show operational awareness. Dead letter queues, monitoring, alerting, and manual retry capabilities signal production experience. These are not afterthoughts — they are essential for a system that other teams depend on.
- Know your numbers. 10M jobs/day is ~115/s average, ~1K/s peak. 100K concurrent with 5s average duration needs ~500 workers. Having these numbers ready shows quantitative thinking.
Practice walking through this architecture end-to-end in under 35 minutes. Focus on the trade-offs at each decision point rather than listing features. If you can explain why you chose lease-based locking over Redlock, or why polling is a good fallback for the delay queue, you will demonstrate the kind of engineering judgment interviewers are looking for. Tools like Hoppers AI can help you rehearse this walkthrough with real-time feedback on structure, depth, and pacing.