Design a Distributed Job Scheduler

TL;DR

A distributed job scheduler accepts tasks (one-time or recurring), stores them durably, and executes them at the right time on a pool of workers -- exactly once if possible, at-least-once if necessary. The hard part is not firing a job at the right second. The hard part is making sure it fires exactly once when 10 scheduler nodes are all looking at the same job table, handling the case where a worker crashes mid-execution, and deciding who creates the next instance of a recurring job. Celery Beat is a single point of failure. Cron on a single machine does not survive reboots. Real job schedulers (Uber's Cherami, Amazon SQS delay queues, Airflow) solve these coordination problems in different ways, and understanding the trade-offs matters.

The System

A job scheduler takes a request like "run this function at 3:00 AM tomorrow" or "run this function every 15 minutes" and makes it happen reliably. The jobs themselves can be anything: sending a batch of emails, generating nightly reports, cleaning up expired database records, syncing data between services, or triggering ML training pipelines.

Uber's internal scheduler handles millions of scheduled tasks per day -- trip reminders, driver payout calculations, surge pricing recalculations, and fraud detection sweeps. Amazon's internal scheduler orchestrates hundreds of millions of Lambda invocations on timers. Airflow, originally built at Airbnb, is now the most popular open-source job scheduler and handles DAGs (directed acyclic graphs) of dependent tasks. All of these systems must guarantee that a scheduled job actually runs, runs at approximately the right time, and does not run twice (or at least handles double-execution gracefully).

Requirements

Functional

Submit a job: Client provides a handler (function/endpoint to call), a payload, and an execution time (one-time) or a cron expression (recurring)
Cancel a job: Cancel a pending job before it executes
Job status: Query the status of a job (pending, running, completed, failed, cancelled)
Retry on failure: Configurable retry count and backoff strategy (immediate, linear, exponential)
Priority levels: High-priority jobs execute before low-priority ones when workers are contiguous
Overlapping execution policy: For recurring jobs, define behavior when the previous execution has not finished: skip, queue, or run in parallel

Non-Functional

At-least-once execution: Every submitted job must execute at least once. Missed executions are unacceptable (a missed payroll job costs real money)
Near-exact timing: Jobs execute within 5 seconds of their scheduled time (not millisecond-precision, but not "sometime this minute")
Throughput: Schedule and execute 100K jobs per minute, with spikes to 500K/min during batch windows (midnight UTC, end-of-month)
Durability: Scheduled jobs survive scheduler restarts, database failovers, and datacenter switchovers
Scalability: Adding more worker nodes should linearly increase execution capacity
Idempotency support: The system should support idempotency keys so that duplicate executions produce the same result as a single execution

Back-of-Envelope Math

Job volume:
  100K jobs/min = 1,667 jobs/sec steady state
  500K jobs/min peak = 8,333 jobs/sec

Job storage:
  Each job record: ~500 bytes (handler, payload, schedule, status, metadata)
  Jobs retained for 30 days: 100K/min * 60 * 24 * 30 = 4.32B records
  4.32B * 500 bytes = 2.16 TB
  Need sharding or a time-series-friendly store. Single Postgres won't hold this.

  But: most of these are completed jobs (historical). Active/pending jobs at any
  moment: ~1M (jobs scheduled within the next hour)
  1M * 500 bytes = 500 MB. Fits in memory.

Worker capacity:
  Average job duration: 2 seconds
  A single worker executes 30 jobs/min (60s / 2s)
  For 100K jobs/min: need 3,334 workers
  For 500K jobs/min peak: need 16,667 workers
  With container-based auto-scaling, this is feasible.

Database write load:
  Each job requires at minimum 3 writes:
    1. INSERT (job created)
    2. UPDATE (status -> running)
    3. UPDATE (status -> completed/failed)
  100K jobs/min * 3 writes = 300K writes/min = 5,000 writes/sec
  A single Postgres instance handles 10K+ writes/sec. Fits for steady state.

The Naive Design

┌──────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────┐
│  Client   │────>│  Scheduler   │────>│  PostgreSQL  │     │  Worker  │
│           │<────│  (single)    │<────│              │────>│  (single)│
└──────────┘     └──────────────┘     └──────────────┘     └──────────┘

-- Poll loop in the scheduler (runs every second):
SELECT * FROM jobs 
WHERE status = 'pending' AND execute_at <= NOW()
ORDER BY execute_at
LIMIT 100;

-- For each job found:
UPDATE jobs SET status = 'running' WHERE id = ?;
-- dispatch to worker

This is basically cron with a database. One scheduler polls the job table, one worker runs jobs. Fine for a startup running 50 jobs/hour.

Where Does This Break First?

The scheduler is a single point of failure. If it crashes, no jobs run until someone restarts it. But even before that, the polling loop has a subtler problem: if you run two scheduler instances for redundancy, they both SELECT the same jobs and dispatch them to different workers. Every job runs twice.

Where It Breaks

Problem 1: Dual execution with multiple schedulers. You need at least two scheduler instances for availability. Both poll the database every second. Both see the same pending jobs. Both dispatch them. Result: every job runs twice. This is the core problem of distributed job scheduling.

Problem 2: Worker crashes mid-execution. A worker picks up a job, starts processing, and crashes. The job's status is "running" but nobody is running it. Without a heartbeat or timeout mechanism, it stays "running" forever -- a zombie job that never completes and never retries.

Problem 3: Polling is wasteful and introduces latency. If the scheduler polls every 1 second, jobs are delayed by up to 1 second on average. If you poll every 100ms to reduce latency, you are running 10 queries/sec against the database. Multiply by 2 schedulers and that is 20 QPS of polling overhead, most of which returns zero rows.

Problem 4: Recurring job re-scheduling. A cron job runs every 15 minutes. After the 12:00 execution completes, who creates the 12:15 execution? If the worker creates it, and the worker crashes after executing but before creating the next instance, the 12:15 run never gets scheduled. If the scheduler creates it, and two schedulers are running, you get two 12:15 instances.

Problem 5: The midnight thundering herd. Many batch jobs are scheduled for midnight UTC (reports, cleanups, data syncs). At 11:59:59, your pending queue has 200 jobs. At 12:00:00, it has 50,000. Your polling query suddenly returns 50,000 rows and your scheduler chokes.

The Real Design

                    ┌──────────────────────────────────────┐
                    │          API Service (stateless)      │
                    │  POST /jobs, GET /jobs/:id            │
                    └────────────────┬─────────────────────┘
                                     │
                                     v
                    ┌──────────────────────────────────────┐
                    │          PostgreSQL (primary)         │
                    │  ┌──────────────────────────────┐    │
                    │  │ jobs table:                    │    │
                    │  │  id, handler, payload, status  │    │
                    │  │  execute_at, priority          │    │
                    │  │  idempotency_key, locked_by    │    │
                    │  │  locked_until, retry_count     │    │
                    │  └──────────────────────────────┘    │
                    └────────────────┬─────────────────────┘
                                     │
              ┌──────────────────────┼──────────────────────┐
              │                      │                      │
     ┌────────v──────┐      ┌────────v──────┐      ┌────────v──────┐
     │  Scheduler 1  │      │  Scheduler 2  │      │  Scheduler 3  │
     │  (leader or   │      │  (peer)       │      │  (peer)       │
     │   peer)       │      │               │      │               │
     └────────┬──────┘      └────────┬──────┘      └────────┬──────┘
              │                      │                      │
              v                      v                      v
     ┌──────────────────────────────────────────────────────────┐
     │                    Worker Pool (N workers)                │
     │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐        │
     │  │Worker 1│  │Worker 2│  │Worker 3│  │Worker N│        │
     │  └────────┘  └────────┘  └────────┘  └────────┘        │
     └──────────────────────────────────────────────────────────┘

The Core Trick: SELECT FOR UPDATE SKIP LOCKED

This is the single most important technique in this entire design. PostgreSQL's SELECT ... FOR UPDATE SKIP LOCKED solves the dual-execution problem at the database level.

-- Each scheduler runs this query every second:
BEGIN;

SELECT id, handler, payload 
FROM jobs 
WHERE status = 'pending' 
  AND execute_at <= NOW()
ORDER BY priority DESC, execute_at ASC
LIMIT 50
FOR UPDATE SKIP LOCKED;

-- For each selected job:
UPDATE jobs 
SET status = 'running', 
    locked_by = 'scheduler-1',
    locked_until = NOW() + INTERVAL '5 minutes'
WHERE id = ?;

COMMIT;

Here is what SKIP LOCKED does: if Scheduler 1 is inside its transaction selecting jobs, those rows are locked. When Scheduler 2 runs the same query simultaneously, SKIP LOCKED tells Postgres to skip the rows that Scheduler 1 already locked, returning only unlocked rows. No duplicates. No waiting. No application-level locking.

Why this is better than application-level locking (Redis lock, ZooKeeper):

The lock is the database transaction itself. No separate lock service to fail.
If the scheduler crashes mid-transaction, Postgres automatically releases the row locks (transaction rollback).
No clock skew issues (no expiration time to calculate).
It is a single SQL query. No multi-step "acquire lock, read job, mark as taken, release lock" dance.

This was added in Postgres 9.5 specifically for job queue use cases. If you are designing a job scheduler and not using SKIP LOCKED, you are making your life harder than it needs to be.

Timer Wheel vs. Min-Heap for Scheduling

The scheduler needs an efficient data structure to track "which jobs are due now?"

Min-Heap (priority queue)

Store pending jobs in a min-heap keyed by execute_at. The job with the earliest execution time is always at the top. Peek: O(1). Insert: O(log N). Delete: O(log N).

Heap: [(12:00:00, job_A), (12:00:05, job_B), (12:00:10, job_C), ...]

Every second:
  while heap.peek().execute_at <= now():
      job = heap.pop()
      dispatch(job)

Pros: Simple, efficient for small-to-medium numbers of pending jobs (< 1M). Cons: At 10M pending jobs, each insert is O(log 10M) = ~23 comparisons. Not terrible, but not great for hot loops.

Timer Wheel (hashed timing wheel)

Divide time into fixed-width slots (e.g., 1 second each). A wheel has N slots. Job at execute_at = T goes into slot T % N. Each tick, the wheel advances and processes all jobs in the current slot.

Wheel with 3600 slots (1 hour at 1-second resolution):

Slot 0:   [job_A, job_D]     <- fires at T+0, T+3600, T+7200, ...
Slot 1:   [job_B]            <- fires at T+1, T+3601, ...
Slot 2:   []
...
Slot 59:  [job_C, job_E, job_F]
...

Insert: O(1) -- just append to the correct slot's list
Fire:   O(k) where k is the number of jobs in the current slot

For jobs scheduled more than N seconds in the future, use a hierarchical timer wheel: a coarse wheel (hour-resolution) that promotes jobs to the fine wheel (second-resolution) as they approach.

Netty (Java networking framework) uses a hashed timer wheel. Kafka uses a hierarchical timer wheel for delayed message delivery.

My recommendation: Use a min-heap if you have fewer than 1M active pending jobs (most systems). Use a timer wheel if you have 10M+ pending jobs or need O(1) insertion (Kafka-scale). In an interview, mention both and justify your choice based on the expected pending job count.

At-Least-Once with Idempotent Tasks

In a distributed system, guaranteeing exactly-once execution is impossible (see: Two Generals Problem). You either get at-most-once (job might not run) or at-least-once (job might run twice). For a job scheduler, at-least-once is the right choice -- a missed payroll is worse than a duplicate payroll (which can be deduplicated).

How duplicate execution happens even with SKIP LOCKED:

1. Scheduler picks job, dispatches to Worker A
2. Worker A starts executing
3. Worker A finishes but crashes before ACKing completion
4. Scheduler's lock timeout expires (5 minutes)
5. Job status is still "running" but locked_until has passed
6. Scheduler picks up the job again, dispatches to Worker B
7. Worker B executes the same job

The job ran twice. The scheduler did everything right -- the crash caused the duplicate.

Making tasks idempotent:

Every job submission includes an idempotency_key. The worker checks this key before side effects:

def execute_job(job):
    # Check if this exact execution already completed
    if db.exists("completed_idempotency_keys", job.idempotency_key):
        return  # Already done, skip

    # Execute the actual work
    result = job.handler(job.payload)

    # Record completion atomically
    db.insert("completed_idempotency_keys", 
              job.idempotency_key, 
              expires_at=now() + timedelta(days=7))

For financial operations (transfers, charges), the idempotency key is essential. Stripe requires an Idempotency-Key header on all POST requests for exactly this reason -- their internal job scheduler may retry a failed charge, and the idempotency key prevents double-charging.

Recurring Job Re-Scheduling

Who creates the next execution of a cron job? This is where Celery Beat gets it wrong.

Celery Beat (the SPOF approach): A single "beat" process wakes up every second, checks which recurring jobs are due, and enqueues them. If the beat process crashes, no recurring jobs run. If you run two beat processes, every recurring job runs twice. Celery's documentation explicitly warns: "you should only run one beat instance."

Better approach: worker creates next execution

1. Scheduler picks recurring job for 12:00 execution
2. Worker executes the job
3. Worker calculates next execution time: 12:15 (from cron expression)
4. Worker atomically: UPDATE jobs SET status='completed' WHERE id=current_id;
                      INSERT INTO jobs (..., execute_at='12:15') VALUES (...);
5. Both operations in the same transaction. Either both succeed or neither does.

Edge case: what if the worker crashes before creating the next execution?

The scheduler has a safety net: a separate "guardian" query that runs every minute:

-- Find recurring jobs whose last execution completed but no next execution exists
SELECT * FROM recurring_jobs r
WHERE NOT EXISTS (
    SELECT 1 FROM jobs j 
    WHERE j.recurring_id = r.id 
    AND j.status IN ('pending', 'running')
)
AND r.enabled = true;

If a recurring job has no pending or running instance, the guardian creates one based on the cron expression. This is a backup -- the worker path handles 99.99% of cases, and the guardian catches the rare crash scenarios.

Timing drift pitfall: If a cron job runs every 5 minutes but execution takes 10 minutes, creating the next execution entry only on completion causes schedule drift (the job runs every 15 minutes instead of 5). Fix: create the next scheduled execution at the intended time immediately when the current execution starts, not when it completes. If the previous run is still in progress when the next fires, apply the overlapping execution policy below (skip if idempotent, queue if every execution matters).

Overlapping Execution Policies

A recurring job runs every 15 minutes, but sometimes it takes 20 minutes to complete. What happens at 12:15 when the 12:00 execution is still running?

Policy 1: Skip -- Do not start the 12:15 execution. Wait for the 12:00 execution to finish, then schedule the next one. This is the safest default. Used by: Airflow (max_active_runs=1).

Policy 2: Queue -- Enqueue the 12:15 execution. It will start as soon as the 12:00 execution finishes. The 12:15 and 12:00 runs never overlap, but the 12:15 run starts late. Risk: if jobs consistently take longer than the interval, the queue grows unboundedly.

Policy 3: Allow parallel -- Start the 12:15 execution even though 12:00 is still running. Both run simultaneously. Only safe if the job is idempotent and does not have ordering dependencies (e.g., a job that sends "daily summary" emails -- two summaries is annoying but not catastrophic).

-- In the scheduler's pickup query:
-- Policy: Skip
SELECT * FROM jobs 
WHERE status = 'pending' 
  AND execute_at <= NOW()
  AND (overlap_policy != 'skip' 
       OR NOT EXISTS (
           SELECT 1 FROM jobs j2 
           WHERE j2.recurring_id = jobs.recurring_id 
           AND j2.status = 'running'
       ))
FOR UPDATE SKIP LOCKED
LIMIT 50;

Deep Dives

Job Scheduler — Job Scheduler High-Level Design

Deep Dive 1: Dead Letter Queues and Retry Strategies

A job fails. Now what?

Retry with exponential backoff:

Attempt 1: execute immediately
Attempt 2: retry after 30 seconds
Attempt 3: retry after 2 minutes
Attempt 4: retry after 8 minutes
Attempt 5: retry after 30 minutes

Add jitter (random delay of up to 25% of the backoff) to prevent thundering herd when many jobs fail simultaneously (e.g., a downstream service outage causes 10,000 jobs to fail at once, and they all retry at the same backoff interval).

def next_retry_delay(attempt):
    base_delay = min(30 * (2 ** attempt), 1800)  # max 30 minutes
    jitter = random.uniform(0, base_delay * 0.25)
    return base_delay + jitter

Dead letter queue (DLQ): After exhausting all retries (e.g., 5 attempts), move the job to a dead letter table. These jobs require human investigation -- the automated retry did not work.

-- DLQ table
CREATE TABLE dead_letter_jobs (
    id BIGINT PRIMARY KEY,
    original_job_id BIGINT REFERENCES jobs(id),
    handler TEXT,
    payload JSONB,
    failure_reason TEXT,
    failed_at TIMESTAMP,
    retry_count INT,
    last_error TEXT
);

An alerting system monitors the DLQ table. If more than N jobs land in the DLQ within M minutes, page the on-call engineer -- something systemic is wrong (database down, external API unreachable).

Deep Dive 2: Handling the Midnight Spike

At midnight UTC, 50,000 daily batch jobs become eligible simultaneously. The scheduler's polling query returns 50,000 rows. The worker pool gets overwhelmed.

Solution 1: Staggered scheduling with jitter

When a client submits a daily job scheduled for "midnight," the system adds random jitter of up to 30 minutes:

scheduled_time = midnight + random.randint(0, 1800)  # 0-30 min jitter

This spreads 50,000 jobs across a 30-minute window instead of a single second.

Solution 2: Priority-based execution with backpressure

Workers pull jobs, not push. Each worker has a concurrency limit (e.g., 10 parallel jobs). When all workers are saturated, pending jobs simply wait in the database. The ORDER BY priority DESC in the pickup query ensures high-priority jobs execute first.

High priority (P0): payment processing, SLA-bound reports
Medium priority (P1): analytics aggregation, cache warming
Low priority (P2): cleanup tasks, non-urgent notifications

During the midnight spike, P0 jobs execute immediately. P1 jobs wait 1-5 minutes. P2 jobs may wait 15-30 minutes. This is acceptable because P2 jobs by definition are not time-sensitive.

Solution 3: Auto-scaling workers

Monitor the pending job queue depth. When pending count exceeds a threshold (e.g., 10x normal), auto-scale the worker pool. Kubernetes HPA (Horizontal Pod Autoscaler) can scale based on a custom metric (pending job count from a Prometheus query).

# Kubernetes HPA
metrics:
  - type: External
    external:
      metric:
        name: pending_job_count
      target:
        type: Value
        value: 1000  # scale up when > 1000 pending jobs

Deep Dive 3: Observability and SLOs

A job scheduler without observability is a black box that silently drops jobs.

Key metrics to track:

1. job_schedule_lag: time between execute_at and actual execution start
   SLO: p99 < 5 seconds
   Alert: p99 > 30 seconds for 5 minutes

2. job_execution_duration: how long each job takes
   No SLO (varies by job type), but alert on > 2x historical p95

3. job_failure_rate: percentage of jobs that fail (after all retries)
   SLO: < 0.1% 
   Alert: > 1% for 5 minutes

4. dlq_depth: number of jobs in the dead letter queue
   Alert: > 100 new DLQ entries in 10 minutes

5. worker_utilization: percentage of worker capacity in use
   Alert: > 80% sustained for 10 minutes (need to scale)

6. pending_queue_depth: jobs waiting to be picked up
   Alert: growing for > 5 minutes (workers can't keep up)

Distributed tracing: Each job carries a trace ID. The scheduler's pickup, worker's execution, and any downstream service calls all log this trace ID. When a job fails, you can trace the entire lifecycle: submission -> scheduling -> dispatch -> execution -> failure.

Alternative Designs

Alternative 1: Message Queue-Based (SQS + Lambda)

Instead of polling a database, use a message queue with delay:

Client -> SQS (with DelaySeconds) -> Lambda (worker)

SQS supports message delays up to 15 minutes. For longer delays, re-enqueue with a new delay. AWS Step Functions support wait states of up to 1 year.

Alternative 2: Kafka + Scheduler Service

Jobs are Kafka messages. A scheduler service consumes from a "scheduled-jobs" topic and produces to an "execute-now" topic at the right time. Workers consume from "execute-now."

Alternative 3: ZooKeeper/etcd for Leader Election + Cron

One scheduler instance is the leader (via ZooKeeper leader election). Only the leader polls. If the leader dies, ZooKeeper elects a new leader within seconds.

Aspect	Postgres SKIP LOCKED	SQS + Lambda	Kafka + Scheduler	ZooKeeper Leader + Cron
At-least-once guarantee	Yes (row locks)	Yes (SQS guarantees)	Yes (offset management)	Yes (leader polls)
Exactly-once possible	With idempotency keys	With idempotency keys	Kafka transactions	With idempotency keys
Max delay	Unlimited	15 min (or re-enqueue)	Unlimited	Unlimited
Throughput	~10K jobs/sec	~3K invocations/sec	100K+ msgs/sec	~1K jobs/sec (single)
Operational complexity	Low (just Postgres)	Low (managed AWS)	High (Kafka cluster)	High (ZK cluster)
Recurring jobs	Custom guardian logic	EventBridge rules	Custom logic	Cron library
Job visibility	SQL queries	CloudWatch	Kafka consumer lag	Custom monitoring
Cost at 100K jobs/min	~$500/mo (DB + workers)	~$3K/mo (Lambda + SQS)	~$2K/mo (Kafka cluster)	~$800/mo (VMs)

I would default to Postgres with SKIP LOCKED for most interview answers. It is the simplest, requires no additional infrastructure (you already have Postgres), and handles 10K jobs/sec on a single instance. If the interviewer pushes beyond that scale, mention Kafka for throughput and SQS+Lambda for serverless simplicity.

Scaling Math Verification

100K jobs/min steady state (1,667 jobs/sec):

Postgres pickup query: 1,667 * 3 writes (insert + mark running + mark complete) = 5,000 writes/sec. Single Postgres handles 10K+. Fine.
SKIP LOCKED contention: with 3 schedulers, each grabbing 50 jobs per query (every second), contention is minimal. Rows are locked for < 100ms (time to mark as running).
Worker capacity: at 2 sec/job average, need 3,334 workers. With 10 containers each running 4 concurrent jobs, need ~84 containers.

500K jobs/min peak (8,333 jobs/sec):

Postgres writes: 25,000/sec. Need a beefier instance (r6g.2xlarge) or batch writes.
Workers: 16,667 needed. Auto-scale from 84 to ~420 containers. Kubernetes can handle this in 2-3 minutes with pre-warmed node pools.

Pending queue during midnight spike:

50,000 jobs become eligible at once. With 3 schedulers each grabbing 50 per second, drain rate = 150 jobs/sec. Time to clear: 50,000 / 150 = 333 seconds (~5.5 minutes). With staggered scheduling, spread across 30 minutes: ~28 jobs/sec average. Well within capacity.

Failure Analysis

Component	Current capacity	At 10x (1M jobs/min)	Breaks?	Fix
Postgres writes	10K writes/sec	50K writes/sec	Yes	Shard by job_id range, or move to Kafka
Postgres SKIP LOCKED	3 schedulers, 50/batch	10 schedulers, 500/batch	Maybe	Partition jobs table by execute_at range
Worker pool	84 containers	4,200 containers	Maybe	Kubernetes with spot instances, pre-warm nodes
Job history storage	2.16 TB/month	21.6 TB/month	Yes	Archive completed jobs to S3, TTL on jobs table
Scheduler polling	3 queries/sec	10 queries/sec	No	--
Network (dispatch)	1,667 dispatches/sec	16,667 dispatches/sec	No	--

The first thing to break at 10x is Postgres write throughput. At 50K writes/sec, you need to shard the jobs table or move job dispatch to Kafka (publish jobs as messages, workers consume). The Postgres table becomes the "source of truth" for scheduling, but Kafka handles the high-throughput dispatch path.

The second bottleneck is storage. At 21.6 TB/month of job history, you need an archival strategy: move completed jobs older than 7 days to S3 (Parquet files for queryability) and keep only active/recent jobs in Postgres.

What's Expected at Each Level

Aspect	Mid-Level	Senior	Staff+
Deduplication	"Use a lock"	Describes database-level row locking	`SELECT FOR UPDATE SKIP LOCKED`, explains why app locks are worse
Recurring jobs	"Use cron"	Worker creates next instance, mentions race condition	Guardian query, overlapping execution policies, Celery Beat SPOF
Failure handling	"Retry the job"	Exponential backoff with retry count	DLQ, idempotency keys, distinguishes transient vs permanent fail
Timing data structure	Not discussed	Min-heap for pending jobs	Timer wheel for high-volume, hierarchical wheel for long delays
Scaling	"Add more workers"	Auto-scaling based on queue depth	Partitioned job table, Kafka for dispatch, midnight spike mitigation
Observability	Job status endpoint	Metrics for lag and failure rate	SLOs with alerting, distributed tracing across job lifecycle
Real-world reference	"Like Airflow"	Celery Beat SPOF, SQS delay queues	Uber's scheduler, Postgres SKIP LOCKED, Kafka timer wheel

The single most important signal at any level: do you understand that the hard problem is not "trigger a function at time T" but "make sure exactly one scheduler picks up job X when multiple schedulers are competing for the same set of jobs"? The SKIP LOCKED answer demonstrates that you have thought about this concretely, not just hand-waved about distributed locking.

References from Our Courses

Kafka Partitions and Ordering — job dispatch and result collection via topics
ZooKeeper Primitives — leader election for scheduler HA
Delivery Guarantees — at-least-once execution with idempotency keys
Etcd and Alternatives — distributed locking for job deduplication

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Distributed Job Scheduler →