We use cookies for analytics to improve our service. See our Privacy Policy.

    Sign up free to unlock interview prep materials and a free mock interview for your next role.

    Start Free
    system design
    question breakdown
    job scheduler

    Design a Job Scheduler

    Hoppers AI Team·March 12, 2026·12 min read

    Design a Job Scheduler — Complete System Design Walkthrough

    Job schedulers are the backbone of every large-scale system. From sending millions of reminder emails at precise times, to running nightly ETL pipelines, to retrying failed payments on a schedule — a reliable, distributed job scheduler is critical infrastructure. This question tests your ability to design for time-based triggers, distributed coordination, exactly-once semantics, and graceful failure handling. In this walkthrough, we will design one from scratch across all six stages of a structured system design interview.

    Job APIJob Store(Database)Scheduler(Time Trigger)Task QueueWorker PoolpollenqueuedequeueDead LetterQueuemax retriesexceededDistributed Lock(Redis / ZK)Monitoring(Metrics/Alerts)retryCore PathAsync / Failure PathJob Scheduler — High-Level Architecture

    Stage 1: Requirements Gathering

    Start by clarifying scope. A job scheduler can mean many things — from a simple cron replacement to a full workflow orchestration engine. Pin down the boundaries with your interviewer before drawing anything.

    Functional Requirements

    • Create and schedule jobs — Users can submit a job with a handler (HTTP callback URL or function identifier), a payload, and a trigger time.
    • One-time and recurring jobs — Support both single-fire jobs ("run at 3pm tomorrow") and cron-style recurring schedules ("every Monday at 9am").
    • Cancel jobs — Users can cancel a pending or recurring job before its next execution.
    • Job priorities — Support at least three priority levels (high, normal, low). Higher-priority jobs are dequeued before lower-priority ones when workers are contended.
    • Job dependencies — A job can declare dependencies on other jobs. It will not execute until all dependencies have completed successfully, forming a directed acyclic graph (DAG).
    • Execution status — Users can query the current state of a job (pending, running, completed, failed).
    • Retry on failure — Failed jobs are retried with exponential backoff up to a configurable maximum retry count.

    Non-Functional Requirements

    • Scale: 10 million jobs scheduled per day, with up to 100,000 executing concurrently at peak.
    • Trigger accuracy: Jobs must fire within 1 second of their scheduled time under normal load.
    • Exactly-once execution: A job must not be executed more than once per trigger (duplicate execution could mean double-charging a customer or sending duplicate emails).
    • Durability: Scheduled jobs must survive server crashes. No job should be silently lost.
    • Availability: 99.99% uptime — the scheduler is critical infrastructure. If it goes down, downstream systems fail silently.
    Interview tip: Stating "10 million jobs per day" translates to roughly 115 jobs per second on average, but job schedules are bursty — expect peaks of 5-10x at the top of every hour and minute. Call this out explicitly to show you understand real-world traffic patterns.

    Stage 2: API Design

    The Job Scheduler exposes a REST API for job management. Keep it simple and resource-oriented.

    Core Endpoints

    MethodEndpointPurpose
    POST/v1/jobsCreate and schedule a new job
    GET/v1/jobs?status=pending&limit=50&cursor=List jobs with filtering and pagination
    GET/v1/jobs/{jobId}Get job details and current status
    DELETE/v1/jobs/{jobId}Cancel a pending or recurring job
    GET/v1/jobs/{jobId}/executionsGet execution history for a job

    Create Job Request

    The create endpoint is the most important. Here is a representative payload:

    FieldTypeDescription
    namestringHuman-readable job name
    handlerstringCallback URL or registered function ID
    payloadJSON objectArbitrary data passed to the handler at execution time
    scheduleAtISO 8601 timestampWhen to first trigger the job (for one-time jobs)
    cronExpressionstring (optional)Cron expression for recurring jobs (e.g., 0 9 * * MON)
    priorityenumhigh, normal, low (default: normal)
    maxRetriesintegerMaximum retry attempts on failure (default: 3)
    dependsOnarray of jobIdsJobs that must complete before this one can run
    timeoutSecondsintegerMaximum execution duration before the job is killed (default: 300)

    The response returns a jobId, the computed nextRunAt timestamp, and the initial status of pending.

    Design decision: We use a callback-based execution model (the scheduler calls a URL when the job fires) rather than embedding execution logic. This keeps the scheduler generic — it does not need to know what jobs do. The handler could be a Lambda function, a Kubernetes pod, or a simple webhook endpoint.

    Stage 3: Data Model

    We need two primary tables and a well-defined state machine for job lifecycle management.

    Jobs Table

    ColumnTypePurpose
    job_idUUID (PK)Unique identifier
    namestringHuman-readable name
    handlerstringCallback URL or function ID
    payloadJSONData passed to handler
    schedule_typeenumone_time or recurring
    cron_expressionstring (nullable)Cron expression for recurring jobs
    next_run_attimestampNext scheduled execution time (indexed)
    statusenumpending, running, completed, failed, cancelled
    priorityinteger0 (high), 1 (normal), 2 (low)
    max_retriesintegerMaximum retry attempts
    retry_countintegerCurrent retry attempt number
    timeout_secondsintegerExecution timeout
    depends_onarray of UUIDsParent job IDs (DAG edges)
    partition_keyintegerScheduler partition assignment (0..N-1)
    locked_bystring (nullable)Scheduler instance ID holding the lock
    locked_untiltimestamp (nullable)Lock expiration (lease-based)
    created_attimestamp
    updated_attimestamp

    Execution History Table

    ColumnTypePurpose
    execution_idUUID (PK)Unique execution identifier
    job_idUUID (FK, indexed)Parent job
    attemptintegerWhich attempt (1, 2, 3...)
    statusenumrunning, completed, failed, timed_out
    started_attimestamp
    finished_attimestamp (nullable)
    error_messagetext (nullable)Failure reason
    worker_idstringWhich worker executed this attempt

    State Machine

    Every job follows a strict state machine that prevents invalid transitions:

    • pending — Job is scheduled and waiting for its trigger time. Transitions to running when picked up by a worker, or cancelled if the user cancels it.
    • running — Job is actively being executed by a worker. Transitions to completed on success, failed on error, or back to pending on timeout (if retries remain).
    • completed — Terminal state. For recurring jobs, a new execution cycle begins: next_run_at is recomputed from the cron expression, and status resets to pending.
    • failed — If retry_count < max_retries, transitions back to pending with an exponential backoff delay applied to next_run_at. Otherwise, terminal — the job is moved to the dead letter queue.
    • cancelled — Terminal state. No further executions.

    Key Indexes

    • Composite index on (partition_key, status, next_run_at) — This is the critical query path. Each scheduler partition polls for jobs where status = 'pending' and next_run_at <= NOW() within its assigned partition. The composite index makes this a fast range scan.
    • Index on job_id in the execution history table — For querying execution history of a specific job.
    Interview tip: Mentioning the composite index proactively shows the interviewer you are thinking about query performance at scale, not just logical correctness.

    Stage 4: High-Level Architecture

    The system has four major components: the Job API, the Scheduler, the Task Queue, and the Worker Pool. Let us trace the lifecycle of a job through each.

    Job API

    A stateless REST service behind a load balancer. It validates requests, writes jobs to the Job Store (PostgreSQL), and returns immediately. The API does not execute jobs — it only manages their metadata. For recurring jobs, it parses the cron expression and computes the initial next_run_at.

    Scheduler (Time-Based Trigger)

    The Scheduler is the heart of the system. It is responsible for detecting when a job's trigger time has arrived and enqueuing it for execution. There are two architectural approaches:

    Approach 1: Polling. Each Scheduler instance periodically queries the Job Store for jobs where next_run_at <= NOW() and status = 'pending'. Polling interval is typically 500ms to 1s. This is simple, battle-tested, and works well up to moderate scale. The downside is wasted queries when no jobs are due, and a worst-case delay equal to the polling interval.

    Approach 2: Delay queue / event-driven. When a job is created, the API pushes it into a delay queue (e.g., Redis sorted set with score = trigger timestamp, or a message broker with delayed delivery). The Scheduler consumes from this queue, receiving jobs exactly when they are due. This eliminates polling overhead but introduces a dependency on the delay queue's reliability and ordering guarantees.

    We choose a hybrid approach: use a delay queue (Redis sorted set with ZRANGEBYSCORE) as the primary trigger mechanism, with periodic polling as a safety net to catch any jobs that might have been missed due to queue failures. This gives us sub-second trigger accuracy with a durable fallback.

    Task Queue

    Once the Scheduler determines a job is ready, it enqueues it into a priority task queue (e.g., RabbitMQ with priority queues, or a multi-queue setup with separate queues per priority level). The queue decouples the triggering decision from actual execution, allowing the worker pool to scale independently.

    Worker Pool

    Workers consume from the task queue, execute the job (by calling the handler URL with the payload), and report the result back. Workers are stateless and horizontally scalable. Each worker:

    1. Dequeues a job from the task queue.
    2. Creates an execution record with status running.
    3. Calls the handler URL with the job payload (HTTP POST with a timeout).
    4. On success: updates execution to completed, updates job status. For recurring jobs, computes next next_run_at.
    5. On failure: increments retry_count. If retries remain, resets status to pending with a backoff delay. Otherwise, marks the job as failed and pushes it to the Dead Letter Queue.

    Stage 5: Deep Dive

    We will dive deep into two critical challenges: distributed locking for exactly-once execution and job dependencies with DAG execution.

    Deep Dive 1: Distributed Locking for Exactly-Once Execution

    With multiple Scheduler instances polling for due jobs, the same job could be picked up by two instances simultaneously, leading to duplicate execution. This is unacceptable for operations like payment processing or email delivery.

    Lease-Based Locking

    We use a lease-based locking mechanism directly in the database. When a Scheduler picks up a job, it performs an atomic conditional update:

    UPDATE jobs SET status = 'running', locked_by = 'scheduler-3', locked_until = NOW() + INTERVAL '30 seconds' WHERE job_id = ? AND status = 'pending' AND (locked_by IS NULL OR locked_until < NOW())

    This query succeeds for exactly one Scheduler instance due to the WHERE clause acting as an optimistic lock. If another instance tries the same update, the status is no longer pending and the update affects zero rows.

    Why Lease-Based, Not Permanent Locks?

    If a Scheduler acquires a lock and then crashes before enqueuing the job, a permanent lock would leave the job stranded forever. The lease (a time-bounded lock via locked_until) ensures that after 30 seconds, another Scheduler can reclaim the job. The lease duration should be significantly longer than the time needed to enqueue the job (typically under 100ms) but short enough to limit delay on recovery.

    Redis as an Alternative Lock Store

    For higher throughput, you can use Redis with the Redlock algorithm. The Scheduler calls SET job:{jobId}:lock scheduler-3 NX PX 30000 before processing. This is faster than a database round-trip and avoids contention on the jobs table. However, it introduces Redis as a critical dependency and requires careful handling of clock skew across Redis nodes. For most use cases, the database-level lease is simpler and sufficient.

    End-to-End Exactly-Once Guarantee

    The lock prevents duplicate triggering, but we also need to prevent duplicate execution at the worker level. The task queue should use acknowledgment-based delivery: a job is invisible to other workers while one is processing it, and only permanently removed from the queue when the worker sends an explicit ACK. If the worker crashes, the visibility timeout expires and the job becomes available for another worker. Combined with an idempotency key (the execution_id), the handler can detect and deduplicate retries.

    Deep Dive 2: Job Dependencies and DAG Execution

    Supporting job dependencies means allowing users to define a directed acyclic graph where a job only runs after all its upstream dependencies have completed successfully.

    DAG Representation

    Each job stores a depends_on array of parent job IDs. This implicitly defines a DAG. Before a job can transition from pending to running, we verify that all parent jobs have status completed. The Scheduler performs this check when evaluating a job:

    SELECT COUNT(*) FROM jobs WHERE job_id = ANY(current_job.depends_on) AND status != 'completed'

    If the count is zero, all dependencies are satisfied and the job can proceed. Otherwise, it remains in pending.

    Triggering Downstream Jobs

    When a job completes, we need to check whether any downstream jobs are now unblocked. We maintain a reverse dependency index — either as a separate table or as a query on the depends_on column:

    SELECT job_id FROM jobs WHERE ? = ANY(depends_on) AND status = 'pending'

    For each returned job, we re-evaluate its full dependency set. If all parents are completed, we update its next_run_at to NOW() so the Scheduler picks it up immediately.

    Cycle Detection

    When a job with dependencies is created, the API must validate that adding this edge does not create a cycle. We perform a depth-first traversal from the new job's dependencies, following each parent's depends_on chain. If we encounter the new job's ID during traversal, a cycle exists and we reject the request with a 400 Bad Request.

    Failure Propagation

    When a job in a DAG fails permanently (exhausts all retries), downstream jobs cannot proceed. We propagate the failure by transitioning all reachable downstream jobs to a blocked state (a sub-state of pending) and notifying the user. The user can then fix the root cause, manually retry the failed job, and the DAG resumes automatically.

    Interview tip: When discussing DAG execution, draw a small example (A → B → D, A → C → D) on the whiteboard. It makes the dependency resolution algorithm concrete and shows the interviewer you can communicate abstract concepts visually.

    Stage 6: Scaling and Trade-Offs

    Partitioned Schedulers

    A single Scheduler instance polling the entire jobs table will not scale to 10 million jobs per day. We partition the job space across multiple Scheduler instances:

    • Each job is assigned a partition_key (computed as hash(job_id) % N where N is the number of partitions).
    • Each Scheduler instance owns a range of partitions and only polls for jobs within its assigned range.
    • Partition assignment is managed by a coordination service (ZooKeeper or etcd). When a Scheduler instance joins or leaves, partitions are rebalanced.

    With 16 partitions and 4 Scheduler instances, each instance handles approximately 25% of the job space. If one instance fails, its partitions are redistributed to the survivors. This is similar to how Kafka consumer groups manage partition assignment.

    Worker Autoscaling

    Worker demand is inherently bursty. At the top of every hour, thousands of cron jobs fire simultaneously. We implement autoscaling based on queue depth:

    • Scale-up trigger: When the task queue depth exceeds a threshold (e.g., 1,000 pending items) for more than 30 seconds, add worker instances.
    • Scale-down trigger: When workers are idle (no jobs dequeued for 5 minutes), remove instances gradually.
    • Pre-scaling: For predictable traffic patterns (top-of-hour spikes), preemptively scale up at :58 every hour based on historical data.

    If running on Kubernetes, use the KEDA (Kubernetes Event-Driven Autoscaler) to scale worker deployments based on queue length. On AWS, Lambda functions behind SQS provide automatic scaling with zero management overhead.

    Failure Handling

    There are three categories of failure, each handled differently:

    Worker failure mid-execution. The task queue's visibility timeout ensures the job reappears after the timeout expires. Another worker picks it up. The retry_count increments, and the execution record shows a timed_out entry for the original attempt.

    Scheduler failure. Partition reassignment happens within seconds via the coordination service. The lease-based lock on any in-progress jobs expires, and the new partition owner re-evaluates them. Worst case, a job fires 30 seconds late (the lease duration).

    Database failure. This is the most severe. The Scheduler and API are both unavailable. Mitigation: use a multi-AZ PostgreSQL deployment with automatic failover (RDS Multi-AZ, or Patroni for self-managed). Failover typically completes in under 30 seconds. During the outage, the task queue continues processing already-enqueued jobs, limiting the blast radius.

    Dead Letter Queue

    Jobs that exhaust all retry attempts are moved to a Dead Letter Queue (DLQ). The DLQ serves multiple purposes:

    • Visibility: Operators can inspect failed jobs, understand why they failed, and decide whether to fix and retry or discard.
    • Alerting: A monitoring system watches the DLQ depth. If it exceeds a threshold, it fires an alert. A growing DLQ often indicates a systemic issue (downstream service outage, misconfigured handler).
    • Manual retry: An admin API endpoint allows replaying jobs from the DLQ back into the main queue after the root cause is fixed.

    Capacity Estimation

    • Job storage: 10M jobs/day, average 1 KB per job = 10 GB/day of new data. With 90-day retention, that is ~900 GB. PostgreSQL handles this comfortably with proper indexing and partitioning by created_at.
    • Queue throughput: Peak of ~1,000 jobs/second (10x average). RabbitMQ or SQS handle this easily. Kafka can handle orders of magnitude more if needed.
    • Worker count: With 100K concurrent jobs and an average execution time of 5 seconds, we need ~100K / (1/5) = ~500 workers at peak. With autoscaling, baseline can be 50 workers scaling up to 500.

    Trade-Off Summary

    DecisionChoiceTrade-Off
    Trigger mechanismRedis sorted set + DB polling fallbackSub-second accuracy with durability guarantee. Redis failure degrades to polling (1s delay), not outage.
    LockingDatabase lease-basedSimpler than Redlock, but adds a write per job trigger. Acceptable at 1K/s peak.
    QueueRabbitMQ with priority queuesNative priority support. Less throughput than Kafka, but sufficient for our scale and simpler operationally.
    Job storePostgreSQLStrong consistency, transactions for state machine transitions. Shard if we exceed 100K writes/second.
    DAG executionReverse dependency lookup on completionAdds query overhead on job completion, but keeps the data model simple. For complex DAGs (1000+ nodes), consider a dedicated graph store.
    Worker scalingQueue-depth autoscalingReactive, so there is a brief lag on sudden spikes. Pre-scaling at known peak times compensates.

    Scoring Tips

    To score well on a job scheduler design question, keep these principles in mind:

    • Clarify job types early. One-time vs. recurring vs. DAG-based jobs have very different implications for the data model and scheduler logic. Show the interviewer you recognize this.
    • Exactly-once is the key challenge. Interviewers will probe on duplicate execution. Explain the lease-based lock mechanism clearly, including what happens when a lock holder crashes. Mention the visibility timeout on the task queue as the second layer of protection.
    • Distinguish triggering from execution. The Scheduler decides when to run a job. The Worker decides how. Keeping these separate enables independent scaling and simpler failure handling.
    • Address the bursty traffic pattern. Cron jobs cluster at the top of every hour and minute. If you do not mention this, the interviewer will ask. Pre-scaling and queue buffering are the standard answers.
    • Show operational awareness. Dead letter queues, monitoring, alerting, and manual retry capabilities signal production experience. These are not afterthoughts — they are essential for a system that other teams depend on.
    • Know your numbers. 10M jobs/day is ~115/s average, ~1K/s peak. 100K concurrent with 5s average duration needs ~500 workers. Having these numbers ready shows quantitative thinking.

    Practice walking through this architecture end-to-end in under 35 minutes. Focus on the trade-offs at each decision point rather than listing features. If you can explain why you chose lease-based locking over Redlock, or why polling is a good fallback for the delay queue, you will demonstrate the kind of engineering judgment interviewers are looking for. Tools like Hoppers AI can help you rehearse this walkthrough with real-time feedback on structure, depth, and pacing.