Design a Distributed Job Scheduler
TL;DR
A distributed job scheduler accepts tasks (one-time or recurring), stores them durably, and executes them at the right time on a pool of workers -- exactly once if possible, at-least-once if necessary. The hard part is not firing a job at the right second. The hard part is making sure it fires exactly once when 10 scheduler nodes are all looking at the same job table, handling the case where a worker crashes mid-execution, and deciding who creates the next instance of a recurring job. Celery Beat is a single point of failure. Cron on a single machine does not survive reboots. Real job schedulers (Uber's Cherami, Amazon SQS delay queues, Airflow) solve these coordination problems in different ways, and understanding the trade-offs matters.
The System
A job scheduler takes a request like "run this function at 3:00 AM tomorrow" or "run this function every 15 minutes" and makes it happen reliably. The jobs themselves can be anything: sending a batch of emails, generating nightly reports, cleaning up expired database records, syncing data between services, or triggering ML training pipelines.
Uber's internal scheduler handles millions of scheduled tasks per day -- trip reminders, driver payout calculations, surge pricing recalculations, and fraud detection sweeps. Amazon's internal scheduler orchestrates hundreds of millions of Lambda invocations on timers. Airflow, originally built at Airbnb, is now the most popular open-source job scheduler and handles DAGs (directed acyclic graphs) of dependent tasks. All of these systems must guarantee that a scheduled job actually runs, runs at approximately the right time, and does not run twice (or at least handles double-execution gracefully).
Requirements
Functional
- Submit a job: Client provides a handler (function/endpoint to call), a payload, and an execution time (one-time) or a cron expression (recurring)
- Cancel a job: Cancel a pending job before it executes
- Job status: Query the status of a job (pending, running, completed, failed, cancelled)
- Retry on failure: Configurable retry count and backoff strategy (immediate, linear, exponential)
- Priority levels: High-priority jobs execute before low-priority ones when workers are contiguous
- Overlapping execution policy: For recurring jobs, define behavior when the previous execution has not finished: skip, queue, or run in parallel
Non-Functional
- At-least-once execution: Every submitted job must execute at least once. Missed executions are unacceptable (a missed payroll job costs real money)
- Near-exact timing: Jobs execute within 5 seconds of their scheduled time (not millisecond-precision, but not "sometime this minute")
- Throughput: Schedule and execute 100K jobs per minute, with spikes to 500K/min during batch windows (midnight UTC, end-of-month)
- Durability: Scheduled jobs survive scheduler restarts, database failovers, and datacenter switchovers
- Scalability: Adding more worker nodes should linearly increase execution capacity
- Idempotency support: The system should support idempotency keys so that duplicate executions produce the same result as a single execution
Back-of-Envelope Math
Job volume:
100K jobs/min = 1,667 jobs/sec steady state
500K jobs/min peak = 8,333 jobs/sec
Job storage:
Each job record: ~500 bytes (handler, payload, schedule, status, metadata)
Jobs retained for 30 days: 100K/min * 60 * 24 * 30 = 4.32B records
4.32B * 500 bytes = 2.16 TB
Need sharding or a time-series-friendly store. Single Postgres won't hold this.
But: most of these are completed jobs (historical). Active/pending jobs at any
moment: ~1M (jobs scheduled within the next hour)
1M * 500 bytes = 500 MB. Fits in memory.
Worker capacity:
Average job duration: 2 seconds
A single worker executes 30 jobs/min (60s / 2s)
For 100K jobs/min: need 3,334 workers
For 500K jobs/min peak: need 16,667 workers
With container-based auto-scaling, this is feasible.
Database write load:
Each job requires at minimum 3 writes:
1. INSERT (job created)
2. UPDATE (status -> running)
3. UPDATE (status -> completed/failed)
100K jobs/min * 3 writes = 300K writes/min = 5,000 writes/sec
A single Postgres instance handles 10K+ writes/sec. Fits for steady state.
The Naive Design
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐
│ Client │────>│ Scheduler │────>│ PostgreSQL │ │ Worker │
│ │<────│ (single) │<────│ │────>│ (single)│
└──────────┘ └──────────────┘ └──────────────┘ └──────────┘
-- Poll loop in the scheduler (runs every second):
SELECT * FROM jobs
WHERE status = 'pending' AND execute_at <= NOW()
ORDER BY execute_at
LIMIT 100;
-- For each job found:
UPDATE jobs SET status = 'running' WHERE id = ?;
-- dispatch to worker
This is basically cron with a database. One scheduler polls the job table, one worker runs jobs. Fine for a startup running 50 jobs/hour.
Where Does This Break First?
The scheduler is a single point of failure. If it crashes, no jobs run until someone restarts it. But even before that, the polling loop has a subtler problem: if you run two scheduler instances for redundancy, they both SELECT the same jobs and dispatch them to different workers. Every job runs twice.
Where It Breaks
Problem 1: Dual execution with multiple schedulers. You need at least two scheduler instances for availability. Both poll the database every second. Both see the same pending jobs. Both dispatch them. Result: every job runs twice. This is the core problem of distributed job scheduling.
Problem 2: Worker crashes mid-execution. A worker picks up a job, starts processing, and crashes. The job's status is "running" but nobody is running it. Without a heartbeat or timeout mechanism, it stays "running" forever -- a zombie job that never completes and never retries.
Problem 3: Polling is wasteful and introduces latency. If the scheduler polls every 1 second, jobs are delayed by up to 1 second on average. If you poll every 100ms to reduce latency, you are running 10 queries/sec against the database. Multiply by 2 schedulers and that is 20 QPS of polling overhead, most of which returns zero rows.
Problem 4: Recurring job re-scheduling. A cron job runs every 15 minutes. After the 12:00 execution completes, who creates the 12:15 execution? If the worker creates it, and the worker crashes after executing but before creating the next instance, the 12:15 run never gets scheduled. If the scheduler creates it, and two schedulers are running, you get two 12:15 instances.
Problem 5: The midnight thundering herd. Many batch jobs are scheduled for midnight UTC (reports, cleanups, data syncs). At 11:59:59, your pending queue has 200 jobs. At 12:00:00, it has 50,000. Your polling query suddenly returns 50,000 rows and your scheduler chokes.
The Real Design
┌──────────────────────────────────────┐
│ API Service (stateless) │
│ POST /jobs, GET /jobs/:id │
└────────────────┬─────────────────────┘
│
v
┌──────────────────────────────────────┐
│ PostgreSQL (primary) │
│ ┌──────────────────────────────┐ │
│ │ jobs table: │ │
│ │ id, handler, payload, status │ │
│ │ execute_at, priority │ │
│ │ idempotency_key, locked_by │ │
│ │ locked_until, retry_count │ │
│ └──────────────────────────────┘ │
└────────────────┬─────────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
┌────────v──────┐ ┌────────v──────┐ ┌────────v──────┐
│ Scheduler 1 │ │ Scheduler 2 │ │ Scheduler 3 │
│ (leader or │ │ (peer) │ │ (peer) │
│ peer) │ │ │ │ │
└────────┬──────┘ └────────┬──────┘ └────────┬──────┘
│ │ │
v v v
┌──────────────────────────────────────────────────────────┐
│ Worker Pool (N workers) │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Worker 1│ │Worker 2│ │Worker 3│ │Worker N│ │
│ └────────┘ └────────┘ └────────┘ └────────┘ │
└──────────────────────────────────────────────────────────┘
The Core Trick: SELECT FOR UPDATE SKIP LOCKED
This is the single most important technique in this entire design. PostgreSQL's SELECT ... FOR UPDATE SKIP LOCKED solves the dual-execution problem at the database level.
-- Each scheduler runs this query every second:
BEGIN;
SELECT id, handler, payload
FROM jobs
WHERE status = 'pending'
AND execute_at <= NOW()
ORDER BY priority DESC, execute_at ASC
LIMIT 50
FOR UPDATE SKIP LOCKED;
-- For each selected job:
UPDATE jobs
SET status = 'running',
locked_by = 'scheduler-1',
locked_until = NOW() + INTERVAL '5 minutes'
WHERE id = ?;
COMMIT;
Here is what SKIP LOCKED does: if Scheduler 1 is inside its transaction selecting jobs, those rows are locked. When Scheduler 2 runs the same query simultaneously, SKIP LOCKED tells Postgres to skip the rows that Scheduler 1 already locked, returning only unlocked rows. No duplicates. No waiting. No application-level locking.
Why this is better than application-level locking (Redis lock, ZooKeeper):
- The lock is the database transaction itself. No separate lock service to fail.
- If the scheduler crashes mid-transaction, Postgres automatically releases the row locks (transaction rollback).
- No clock skew issues (no expiration time to calculate).
- It is a single SQL query. No multi-step "acquire lock, read job, mark as taken, release lock" dance.
This was added in Postgres 9.5 specifically for job queue use cases. If you are designing a job scheduler and not using SKIP LOCKED, you are making your life harder than it needs to be.
Timer Wheel vs. Min-Heap for Scheduling
The scheduler needs an efficient data structure to track "which jobs are due now?"
Min-Heap (priority queue)
Store pending jobs in a min-heap keyed by execute_at. The job with the earliest execution time is always at the top. Peek: O(1). Insert: O(log N). Delete: O(log N).
Heap: [(12:00:00, job_A), (12:00:05, job_B), (12:00:10, job_C), ...]
Every second:
while heap.peek().execute_at <= now():
job = heap.pop()
dispatch(job)
Pros: Simple, efficient for small-to-medium numbers of pending jobs (< 1M). Cons: At 10M pending jobs, each insert is O(log 10M) = ~23 comparisons. Not terrible, but not great for hot loops.
Timer Wheel (hashed timing wheel)
Divide time into fixed-width slots (e.g., 1 second each). A wheel has N slots. Job at execute_at = T goes into slot T % N. Each tick, the wheel advances and processes all jobs in the current slot.
Wheel with 3600 slots (1 hour at 1-second resolution):
Slot 0: [job_A, job_D] <- fires at T+0, T+3600, T+7200, ...
Slot 1: [job_B] <- fires at T+1, T+3601, ...
Slot 2: []
...
Slot 59: [job_C, job_E, job_F]
...
Insert: O(1) -- just append to the correct slot's list
Fire: O(k) where k is the number of jobs in the current slot
For jobs scheduled more than N seconds in the future, use a hierarchical timer wheel: a coarse wheel (hour-resolution) that promotes jobs to the fine wheel (second-resolution) as they approach.
Netty (Java networking framework) uses a hashed timer wheel. Kafka uses a hierarchical timer wheel for delayed message delivery.
My recommendation: Use a min-heap if you have fewer than 1M active pending jobs (most systems). Use a timer wheel if you have 10M+ pending jobs or need O(1) insertion (Kafka-scale). In an interview, mention both and justify your choice based on the expected pending job count.
At-Least-Once with Idempotent Tasks
In a distributed system, guaranteeing exactly-once execution is impossible (see: Two Generals Problem). You either get at-most-once (job might not run) or at-least-once (job might run twice). For a job scheduler, at-least-once is the right choice -- a missed payroll is worse than a duplicate payroll (which can be deduplicated).
How duplicate execution happens even with SKIP LOCKED:
1. Scheduler picks job, dispatches to Worker A
2. Worker A starts executing
3. Worker A finishes but crashes before ACKing completion
4. Scheduler's lock timeout expires (5 minutes)
5. Job status is still "running" but locked_until has passed
6. Scheduler picks up the job again, dispatches to Worker B
7. Worker B executes the same job
The job ran twice. The scheduler did everything right -- the crash caused the duplicate.
Making tasks idempotent:
Every job submission includes an idempotency_key. The worker checks this key before side effects:
def execute_job(job):
# Check if this exact execution already completed
if db.exists("completed_idempotency_keys", job.idempotency_key):
return # Already done, skip
# Execute the actual work
result = job.handler(job.payload)
# Record completion atomically
db.insert("completed_idempotency_keys",
job.idempotency_key,
expires_at=now() + timedelta(days=7))
For financial operations (transfers, charges), the idempotency key is essential. Stripe requires an Idempotency-Key header on all POST requests for exactly this reason -- their internal job scheduler may retry a failed charge, and the idempotency key prevents double-charging.
Recurring Job Re-Scheduling
Who creates the next execution of a cron job? This is where Celery Beat gets it wrong.
Celery Beat (the SPOF approach): A single "beat" process wakes up every second, checks which recurring jobs are due, and enqueues them. If the beat process crashes, no recurring jobs run. If you run two beat processes, every recurring job runs twice. Celery's documentation explicitly warns: "you should only run one beat instance."
Better approach: worker creates next execution
1. Scheduler picks recurring job for 12:00 execution
2. Worker executes the job
3. Worker calculates next execution time: 12:15 (from cron expression)
4. Worker atomically: UPDATE jobs SET status='completed' WHERE id=current_id;
INSERT INTO jobs (..., execute_at='12:15') VALUES (...);
5. Both operations in the same transaction. Either both succeed or neither does.
Edge case: what if the worker crashes before creating the next execution?
The scheduler has a safety net: a separate "guardian" query that runs every minute:
-- Find recurring jobs whose last execution completed but no next execution exists
SELECT * FROM recurring_jobs r
WHERE NOT EXISTS (
SELECT 1 FROM jobs j
WHERE j.recurring_id = r.id
AND j.status IN ('pending', 'running')
)
AND r.enabled = true;
If a recurring job has no pending or running instance, the guardian creates one based on the cron expression. This is a backup -- the worker path handles 99.99% of cases, and the guardian catches the rare crash scenarios.
Timing drift pitfall: If a cron job runs every 5 minutes but execution takes 10 minutes, creating the next execution entry only on completion causes schedule drift (the job runs every 15 minutes instead of 5). Fix: create the next scheduled execution at the intended time immediately when the current execution starts, not when it completes. If the previous run is still in progress when the next fires, apply the overlapping execution policy below (skip if idempotent, queue if every execution matters).
Overlapping Execution Policies
A recurring job runs every 15 minutes, but sometimes it takes 20 minutes to complete. What happens at 12:15 when the 12:00 execution is still running?
Policy 1: Skip -- Do not start the 12:15 execution. Wait for the 12:00 execution to finish, then schedule the next one. This is the safest default. Used by: Airflow (max_active_runs=1).
Policy 2: Queue -- Enqueue the 12:15 execution. It will start as soon as the 12:00 execution finishes. The 12:15 and 12:00 runs never overlap, but the 12:15 run starts late. Risk: if jobs consistently take longer than the interval, the queue grows unboundedly.
Policy 3: Allow parallel -- Start the 12:15 execution even though 12:00 is still running. Both run simultaneously. Only safe if the job is idempotent and does not have ordering dependencies (e.g., a job that sends "daily summary" emails -- two summaries is annoying but not catastrophic).
-- In the scheduler's pickup query:
-- Policy: Skip
SELECT * FROM jobs
WHERE status = 'pending'
AND execute_at <= NOW()
AND (overlap_policy != 'skip'
OR NOT EXISTS (
SELECT 1 FROM jobs j2
WHERE j2.recurring_id = jobs.recurring_id
AND j2.status = 'running'
))
FOR UPDATE SKIP LOCKED
LIMIT 50;
Deep Dives

Deep Dive 1: Dead Letter Queues and Retry Strategies
A job fails. Now what?
Retry with exponential backoff:
Attempt 1: execute immediately
Attempt 2: retry after 30 seconds
Attempt 3: retry after 2 minutes
Attempt 4: retry after 8 minutes
Attempt 5: retry after 30 minutes
Add jitter (random delay of up to 25% of the backoff) to prevent thundering herd when many jobs fail simultaneously (e.g., a downstream service outage causes 10,000 jobs to fail at once, and they all retry at the same backoff interval).
def next_retry_delay(attempt):
base_delay = min(30 * (2 ** attempt), 1800) # max 30 minutes
jitter = random.uniform(0, base_delay * 0.25)
return base_delay + jitter
Dead letter queue (DLQ): After exhausting all retries (e.g., 5 attempts), move the job to a dead letter table. These jobs require human investigation -- the automated retry did not work.
-- DLQ table
CREATE TABLE dead_letter_jobs (
id BIGINT PRIMARY KEY,
original_job_id BIGINT REFERENCES jobs(id),
handler TEXT,
payload JSONB,
failure_reason TEXT,
failed_at TIMESTAMP,
retry_count INT,
last_error TEXT
);
An alerting system monitors the DLQ table. If more than N jobs land in the DLQ within M minutes, page the on-call engineer -- something systemic is wrong (database down, external API unreachable).
Deep Dive 2: Handling the Midnight Spike
At midnight UTC, 50,000 daily batch jobs become eligible simultaneously. The scheduler's polling query returns 50,000 rows. The worker pool gets overwhelmed.
Solution 1: Staggered scheduling with jitter
When a client submits a daily job scheduled for "midnight," the system adds random jitter of up to 30 minutes:
This spreads 50,000 jobs across a 30-minute window instead of a single second.
Solution 2: Priority-based execution with backpressure
Workers pull jobs, not push. Each worker has a concurrency limit (e.g., 10 parallel jobs). When all workers are saturated, pending jobs simply wait in the database. The ORDER BY priority DESC in the pickup query ensures high-priority jobs execute first.
High priority (P0): payment processing, SLA-bound reports
Medium priority (P1): analytics aggregation, cache warming
Low priority (P2): cleanup tasks, non-urgent notifications
During the midnight spike, P0 jobs execute immediately. P1 jobs wait 1-5 minutes. P2 jobs may wait 15-30 minutes. This is acceptable because P2 jobs by definition are not time-sensitive.
Solution 3: Auto-scaling workers
Monitor the pending job queue depth. When pending count exceeds a threshold (e.g., 10x normal), auto-scale the worker pool. Kubernetes HPA (Horizontal Pod Autoscaler) can scale based on a custom metric (pending job count from a Prometheus query).
# Kubernetes HPA
metrics:
- type: External
external:
metric:
name: pending_job_count
target:
type: Value
value: 1000 # scale up when > 1000 pending jobs
Deep Dive 3: Observability and SLOs
A job scheduler without observability is a black box that silently drops jobs.
Key metrics to track:
1. job_schedule_lag: time between execute_at and actual execution start
SLO: p99 < 5 seconds
Alert: p99 > 30 seconds for 5 minutes
2. job_execution_duration: how long each job takes
No SLO (varies by job type), but alert on > 2x historical p95
3. job_failure_rate: percentage of jobs that fail (after all retries)
SLO: < 0.1%
Alert: > 1% for 5 minutes
4. dlq_depth: number of jobs in the dead letter queue
Alert: > 100 new DLQ entries in 10 minutes
5. worker_utilization: percentage of worker capacity in use
Alert: > 80% sustained for 10 minutes (need to scale)
6. pending_queue_depth: jobs waiting to be picked up
Alert: growing for > 5 minutes (workers can't keep up)
Distributed tracing: Each job carries a trace ID. The scheduler's pickup, worker's execution, and any downstream service calls all log this trace ID. When a job fails, you can trace the entire lifecycle: submission -> scheduling -> dispatch -> execution -> failure.
Alternative Designs
Alternative 1: Message Queue-Based (SQS + Lambda)
Instead of polling a database, use a message queue with delay:
SQS supports message delays up to 15 minutes. For longer delays, re-enqueue with a new delay. AWS Step Functions support wait states of up to 1 year.
Alternative 2: Kafka + Scheduler Service
Jobs are Kafka messages. A scheduler service consumes from a "scheduled-jobs" topic and produces to an "execute-now" topic at the right time. Workers consume from "execute-now."
Alternative 3: ZooKeeper/etcd for Leader Election + Cron
One scheduler instance is the leader (via ZooKeeper leader election). Only the leader polls. If the leader dies, ZooKeeper elects a new leader within seconds.
| Aspect | Postgres SKIP LOCKED | SQS + Lambda | Kafka + Scheduler | ZooKeeper Leader + Cron |
|---|---|---|---|---|
| At-least-once guarantee | Yes (row locks) | Yes (SQS guarantees) | Yes (offset management) | Yes (leader polls) |
| Exactly-once possible | With idempotency keys | With idempotency keys | Kafka transactions | With idempotency keys |
| Max delay | Unlimited | 15 min (or re-enqueue) | Unlimited | Unlimited |
| Throughput | ~10K jobs/sec | ~3K invocations/sec | 100K+ msgs/sec | ~1K jobs/sec (single) |
| Operational complexity | Low (just Postgres) | Low (managed AWS) | High (Kafka cluster) | High (ZK cluster) |
| Recurring jobs | Custom guardian logic | EventBridge rules | Custom logic | Cron library |
| Job visibility | SQL queries | CloudWatch | Kafka consumer lag | Custom monitoring |
| Cost at 100K jobs/min | ~$500/mo (DB + workers) | ~$3K/mo (Lambda + SQS) | ~$2K/mo (Kafka cluster) | ~$800/mo (VMs) |
I would default to Postgres with SKIP LOCKED for most interview answers. It is the simplest, requires no additional infrastructure (you already have Postgres), and handles 10K jobs/sec on a single instance. If the interviewer pushes beyond that scale, mention Kafka for throughput and SQS+Lambda for serverless simplicity.
Scaling Math Verification
100K jobs/min steady state (1,667 jobs/sec):
- Postgres pickup query: 1,667 * 3 writes (insert + mark running + mark complete) = 5,000 writes/sec. Single Postgres handles 10K+. Fine.
SKIP LOCKEDcontention: with 3 schedulers, each grabbing 50 jobs per query (every second), contention is minimal. Rows are locked for < 100ms (time to mark as running).- Worker capacity: at 2 sec/job average, need 3,334 workers. With 10 containers each running 4 concurrent jobs, need ~84 containers.
500K jobs/min peak (8,333 jobs/sec):
- Postgres writes: 25,000/sec. Need a beefier instance (r6g.2xlarge) or batch writes.
- Workers: 16,667 needed. Auto-scale from 84 to ~420 containers. Kubernetes can handle this in 2-3 minutes with pre-warmed node pools.
Pending queue during midnight spike:
- 50,000 jobs become eligible at once. With 3 schedulers each grabbing 50 per second, drain rate = 150 jobs/sec. Time to clear: 50,000 / 150 = 333 seconds (~5.5 minutes). With staggered scheduling, spread across 30 minutes: ~28 jobs/sec average. Well within capacity.
Failure Analysis
| Component | Current capacity | At 10x (1M jobs/min) | Breaks? | Fix |
|---|---|---|---|---|
| Postgres writes | 10K writes/sec | 50K writes/sec | Yes | Shard by job_id range, or move to Kafka |
| Postgres SKIP LOCKED | 3 schedulers, 50/batch | 10 schedulers, 500/batch | Maybe | Partition jobs table by execute_at range |
| Worker pool | 84 containers | 4,200 containers | Maybe | Kubernetes with spot instances, pre-warm nodes |
| Job history storage | 2.16 TB/month | 21.6 TB/month | Yes | Archive completed jobs to S3, TTL on jobs table |
| Scheduler polling | 3 queries/sec | 10 queries/sec | No | -- |
| Network (dispatch) | 1,667 dispatches/sec | 16,667 dispatches/sec | No | -- |
The first thing to break at 10x is Postgres write throughput. At 50K writes/sec, you need to shard the jobs table or move job dispatch to Kafka (publish jobs as messages, workers consume). The Postgres table becomes the "source of truth" for scheduling, but Kafka handles the high-throughput dispatch path.
The second bottleneck is storage. At 21.6 TB/month of job history, you need an archival strategy: move completed jobs older than 7 days to S3 (Parquet files for queryability) and keep only active/recent jobs in Postgres.
What's Expected at Each Level
| Aspect | Mid-Level | Senior | Staff+ |
|---|---|---|---|
| Deduplication | "Use a lock" | Describes database-level row locking | SELECT FOR UPDATE SKIP LOCKED, explains why app locks are worse |
| Recurring jobs | "Use cron" | Worker creates next instance, mentions race condition | Guardian query, overlapping execution policies, Celery Beat SPOF |
| Failure handling | "Retry the job" | Exponential backoff with retry count | DLQ, idempotency keys, distinguishes transient vs permanent fail |
| Timing data structure | Not discussed | Min-heap for pending jobs | Timer wheel for high-volume, hierarchical wheel for long delays |
| Scaling | "Add more workers" | Auto-scaling based on queue depth | Partitioned job table, Kafka for dispatch, midnight spike mitigation |
| Observability | Job status endpoint | Metrics for lag and failure rate | SLOs with alerting, distributed tracing across job lifecycle |
| Real-world reference | "Like Airflow" | Celery Beat SPOF, SQS delay queues | Uber's scheduler, Postgres SKIP LOCKED, Kafka timer wheel |
The single most important signal at any level: do you understand that the hard problem is not "trigger a function at time T" but "make sure exactly one scheduler picks up job X when multiple schedulers are competing for the same set of jobs"? The SKIP LOCKED answer demonstrates that you have thought about this concretely, not just hand-waved about distributed locking.
References from Our Courses
- Kafka Partitions and Ordering — job dispatch and result collection via topics
- ZooKeeper Primitives — leader election for scheduler HA
- Delivery Guarantees — at-least-once execution with idempotency keys
- Etcd and Alternatives — distributed locking for job deduplication
Red Team This Design
Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.