Design an Online Code Judge

TL;DR

Build LeetCode -- a system where users submit code, the system compiles and runs it against test cases, and returns pass/fail results with execution time and memory usage. Sandboxing is THE challenge. User-submitted code is hostile by default: it can attempt to read /etc/passwd, fork bomb the system, allocate 100 GB of RAM, open network connections, or exploit kernel vulnerabilities to escape the container. gVisor (Google's user-space kernel) and Firecracker (AWS's microVM) are the two production-grade solutions. Compilation is a separate dangerous phase -- C++ templates can generate gigabytes of code from a few lines. Codeforces' pretests/systests pattern (test against a subset during contest, full suite after) is a clever trade-off between real-time feedback and compute cost. The system must handle 10K concurrent submissions during a contest while guaranteeing that execution time is deterministic and reproducible.

The System

LeetCode. A user writes a Python solution to "Two Sum," clicks Submit, and within 5 seconds sees: "Accepted. Runtime: 45 ms (beats 92%). Memory: 14.2 MB (beats 78%)." Behind that simple UX, the system has: received the source code, selected the correct language runtime, compiled it (for compiled languages), created a sandboxed execution environment, run the code against 50+ test cases with strict time and memory limits, compared the output against expected results, measured execution time and memory consumption, and returned the verdict.

Why is this a system design problem and not just "run code in Docker"? Because running untrusted code is one of the most dangerous things a server can do. The user's code is adversarial. It can attempt system calls that crash the host kernel. It can try to escape the sandbox and access other users' submissions. It can consume all available CPU, memory, or disk, denying service to other users. Every competitive programming platform has war stories about contestants exploiting the judge. Codeforces has been taken down by deliberately crafted submissions. HackerRank had a container escape vulnerability. The sandbox is not a nice-to-have; it is the entire architecture.

Requirements

Functional Requirements

Requirement	Details
Code submission	Accept source code in 10+ languages (Python, Java, C++, Go, Rust, etc.).
Compilation	Compile submitted code (for compiled languages) with configurable flags.
Test execution	Run compiled code against test cases with stdin/stdout matching.
Verdicts	Accepted, Wrong Answer, Time Limit Exceeded (TLE), Memory Limit Exceeded (MLE), Runtime Error (RE), Compilation Error (CE).
Resource measurement	Report execution time (ms) and memory usage (MB).
Custom test cases	Users can run code against their own input before official submission.

Non-Functional Requirements

Requirement	Target
Submission-to-verdict latency	< 10 seconds for most problems. < 30 seconds for hard problems with many test cases.
Sandboxing	Complete isolation. No access to host filesystem, network, or other submissions.
Deterministic timing	Same code, same test case = same execution time (within 5% variance).
Availability	99.9% (higher during contests: 99.95%)
Scale	10K concurrent submissions during contests, 1K during normal hours

Back-of-Envelope Math

Normal hours:
  Submissions/hour:          ~50,000
  Submissions/sec:           ~14
  Avg test cases per problem: 50
  Executions/sec:            14 * 50 = 700

Contest (2-hour window, 10K participants):
  Submissions/hour:          ~100,000 (10K users * 5 submissions/problem * 4 problems / 2 hours)
  Submissions/sec:           ~28 avg, ~200 peak (end-of-contest rush)
  Executions/sec:            200 * 50 = 10,000

Execution resources per submission:
  Time limit:                2 seconds per test case
  Memory limit:              256 MB per test case
  Worst-case wall time:      50 test cases * 2 sec = 100 seconds (but parallelized)
  Actual wall time:          50 test cases / 5 parallel workers = 20 seconds max

Worker requirements (contest peak):
  Concurrent executions:     200 submissions/sec * 20 sec = 4,000 concurrent
  Each execution: 1 CPU core + 256 MB RAM
  CPU cores needed:          4,000
  RAM needed:                4,000 * 256 MB = 1 TB

Compilation:
  C++ compilation time:      2-10 seconds per submission
  Java compilation:          1-3 seconds
  Python: no compilation (interpreted)
  Compilation workers:       200/sec * 5 sec avg = 1,000 concurrent

The number that matters: 4,000 concurrent execution environments, each isolated from the others, each limited to 2 seconds of CPU and 256 MB of RAM. This is a massive sandboxing challenge.

Naive Design

Docker container per submission.

Flow:

1. Receive submission (language, code, problem_id).
2. docker run --rm -v code:/submit language-image
   compile /submit/solution.cpp -o /submit/solution
3. For each test case:
   docker run --rm --memory=256m --cpus=1
   timeout 2 /submit/solution < test_input.txt > output.txt
4. diff output.txt expected_output.txt
5. Report verdict.

This actually works for a prototype. Docker provides filesystem isolation, memory limits (--memory), and CPU limits (--cpus). What could go wrong?

Where It Breaks

Problem 1: Docker Is Not a Security Boundary

Docker containers share the host kernel. A malicious submission can exploit kernel vulnerabilities to escape the container. CVE-2019-5736 allowed a container process to overwrite the host runc binary and gain root access on the host. CVE-2020-15257 allowed container escape via the containerd API. Docker is designed for packaging and deployment isolation, not security isolation of hostile code.

Problem 2: Container Startup Is Too Slow

docker run takes 500-1000ms to start a container (image pull, filesystem setup, namespace creation). For 10,000 executions/sec during a contest, you need 10,000 container starts/sec. Docker cannot handle this. You would need to pre-warm containers, but then you have 10,000 idle containers consuming memory.

Problem 3: Fork Bombs

A user submits:

import os
while True:
    os.fork()

This creates processes exponentially. Even with memory limits, each process consumes a PID and kernel memory. With 65,536 PIDs available, the fork bomb exhausts the PID space in under a second, taking down the host (no new processes can be created for any user).

Problem 4: Compilation as Attack Vector

C++ templates are Turing-complete. A user can submit:

template<int N> struct Bomb { enum { value = Bomb<N-1>::value + Bomb<N-2>::value }; };
template<> struct Bomb<0> { enum { value = 0 }; };
template<> struct Bomb<1> { enum { value = 1 }; };
int main() { return Bomb<100>::value; }

This is a relatively mild example. More extreme template metaprogramming can force the compiler to generate gigabytes of intermediate code, exhaust RAM, and crash the compilation server. The compilation phase is as dangerous as the execution phase and needs its own resource limits.

Problem 5: Non-Deterministic Timing

Docker containers on a shared host experience CPU scheduling jitter. A solution that runs in 1.95 seconds on a quiet host might take 2.05 seconds on a busy host and get TLE. Users complain: "My solution is correct but the judge is too slow!" Timing must be deterministic.

Real Design

Online Code Judge — Code Judge High-Level Design

Architecture Overview

┌──────────────┐
│  Web API     │ ── receives submissions, returns verdicts
└──────┬───────┘
       │
┌──────┴───────┐
│  Submission  │ ── queues, deduplicates, prioritizes
│  Queue       │
│  (Redis +    │
│   Kafka)     │
└──────┬───────┘
       │
┌──────┴───────────────────────────────────┐
│  Judge Workers (pool of N machines)       │
│                                           │
│  ┌─────────────────────────────────────┐  │
│  │  Sandbox Layer                      │  │
│  │  ┌──────────┐  ┌──────────┐        │  │
│  │  │ gVisor   │  │ gVisor   │  ...   │  │
│  │  │ sandbox  │  │ sandbox  │        │  │
│  │  │ (sub 1)  │  │ (sub 2)  │        │  │
│  │  └──────────┘  └──────────┘        │  │
│  │  OR                                 │  │
│  │  ┌──────────┐  ┌──────────┐        │  │
│  │  │Firecracker│  │Firecracker│ ...  │  │
│  │  │ microVM  │  │ microVM  │        │  │
│  │  └──────────┘  └──────────┘        │  │
│  └─────────────────────────────────────┘  │
│                                           │
│  ┌─────────────────────────────────────┐  │
│  │  Resource Controller                │  │
│  │  - cgroups v2 (CPU, memory, PIDs)   │  │
│  │  - seccomp (syscall filtering)      │  │
│  │  - time measurement (CPU clock)     │  │
│  └─────────────────────────────────────┘  │
└───────────────────────────────────────────┘

Component 1: Sandboxing -- gVisor vs. Firecracker

These are the two production-grade options for running untrusted code.

gVisor (Google):

gVisor implements a user-space kernel called Sentry. When the user's code makes a system call (e.g., open(), read(), write()), it does NOT reach the host Linux kernel. Instead, gVisor intercepts the syscall and implements it in user space.

Normal Docker:  user code -> syscall -> host kernel -> hardware
gVisor:         user code -> syscall -> gVisor Sentry (user space) -> limited host syscalls -> host kernel

Why is this safer? The host kernel is a massive attack surface (~30 million lines of code, hundreds of syscalls). gVisor exposes only ~200 syscalls to user code (vs. 300+ in Linux) and implements them in a memory-safe language (Go). A kernel exploit that works against Linux does not work against gVisor because gVisor is not Linux.

Performance cost: gVisor adds 5-30% overhead for CPU-bound workloads and 2-10x overhead for I/O-heavy workloads (because every I/O syscall passes through the Sentry). For a code judge, the workload is mostly CPU-bound (algorithms), so 5-15% overhead is typical.

Firecracker (AWS):

Firecracker runs each submission in a lightweight virtual machine (microVM). Each microVM has its own kernel. A kernel exploit inside the VM cannot escape to the host because the VM is hardware-isolated (Intel VT-x / AMD-V).

Firecracker:   user code -> guest kernel -> KVM hypervisor -> host kernel

Startup time: Firecracker boots a microVM in ~125 ms (vs. ~500 ms for Docker). It can launch 150 microVMs per second per host. With pre-booted VM pools, startup is near-instant (hand a pre-warmed VM to the submission).

Performance cost: Near-native. Hardware virtualization adds < 5% overhead for CPU-bound workloads.

Trade-off:

Property	gVisor	Firecracker
Isolation level	User-space kernel (process-level)	Hardware VM (kernel-level)
Startup time	~50 ms (container + gVisor runtime)	~125 ms (microVM boot)
CPU overhead	5-15%	< 5%
I/O overhead	2-10x	Near-native
Security boundary	Syscall filtering	Hardware virtualization
Used by	Google Cloud Run, GKE Sandbox	AWS Lambda, Kata Containers

Recommendation for a code judge: Firecracker for maximum security (hardware isolation is harder to escape than user-space syscall filtering). gVisor if you need faster startup and can accept slightly weaker isolation.

Component 2: Resource Limiting with cgroups v2

Even inside a sandbox, you must limit resources. cgroups v2 is the Linux mechanism for this.

CPU limit:

# Limit to 1 CPU core (100,000 microseconds per 100,000 microsecond period)
echo "100000 100000" > /sys/fs/cgroup/submission_1/cpu.max

Memory limit:

# Limit to 256 MB
echo $((256 * 1024 * 1024)) > /sys/fs/cgroup/submission_1/memory.max
# Disable swap to prevent eviction to disk (which would slow down, not kill, the process)
echo 0 > /sys/fs/cgroup/submission_1/memory.swap.max

PID limit (fork bomb prevention):

# Maximum 50 processes (main process + threads + children)
echo 50 > /sys/fs/cgroup/submission_1/pids.max

This is the fork bomb defense. pids.max = 50 means the fork bomb creates 49 children and then fork() returns EAGAIN. The bomb is contained. The host PID space is unaffected.

Disk I/O limit:

# Limit to 10 MB/s read, 5 MB/s write
echo "8:0 rbps=10485760 wbps=5242880" > /sys/fs/cgroup/submission_1/io.max

Component 3: Compilation as a Separate Phase

Compilation is dangerous and must be sandboxed separately from execution.

Why separate?

Different resource profile: Compilation of C++ needs 2-4 GB RAM and 10-30 seconds. Execution needs 256 MB and 2 seconds. Using execution limits for compilation would OOM the compiler.
Different attack surface: Compiler exploits (malicious includes, template bombs) are distinct from runtime exploits (syscall attacks, fork bombs).
Caching: If two users submit identical code, compile once and reuse the binary. Compilation is expensive; execution is cheap.

Compilation sandbox:

Resource limits:
  CPU time:   30 seconds
  Memory:     2 GB
  PIDs:       100 (compiler spawns child processes for linking)
  Disk write: 100 MB (compiled binary + temp files)
  Network:    NONE (no outbound connections during compilation)

Filesystem isolation: The compilation sandbox has access to:

/usr/bin/g++, /usr/bin/javac (compiler binaries, read-only)
/usr/include, /usr/lib (standard libraries, read-only)
/submit/ (user's source code, read-only; output binary, write)
Nothing else. No /etc/passwd, no /proc, no /dev.

Compilation cache: Hash the source code and language/flags. If the hash exists in cache, skip compilation and use the cached binary. Cache hit rate during contests: ~20% (many contestants submit the same boilerplate with small changes, but the hash includes the entire source, so even one character difference misses).

Component 4: Test Case Execution Pipeline

After compilation, the binary runs against test cases.

Sequential execution (simpler, used by most judges):

for test_case in test_cases:
    result = run_in_sandbox(binary, test_case.input, time_limit, memory_limit)
    if result.status != ACCEPTED:
        return result  // Short-circuit on first failure
return ACCEPTED

Short-circuit: Most judges stop on the first failed test case. If test case 3 fails with Wrong Answer, they do not run test cases 4-50. This saves compute. LeetCode shows which test case failed.

Parallel execution (used during contests for speed):

Run 5-10 test cases in parallel, each in its own sandbox. If any fails, cancel the rest.

Parallel execution math:
  50 test cases, 5 parallel workers
  Each test case: 2 sec max
  Sequential: 50 * 2 = 100 sec worst case
  Parallel (5 workers): 10 rounds * 2 sec = 20 sec worst case
  With short-circuit: typically 2-6 sec (fails early on wrong submissions)

Component 5: Deterministic Timing

Users complain when their solution passes on their machine but gets TLE on the judge. Timing must be reproducible.

Problem: CPU scheduling on a shared host introduces jitter. Two runs of the same code can differ by 20-50%.

Solution 1: CPU pinning.

Pin each execution to a specific CPU core using taskset. No other processes run on that core. Eliminates scheduling jitter from other workloads.

taskset -c 3 /sandbox/solution < input.txt

Solution 2: Measure CPU time, not wall time.

Use getrusage() to measure user CPU time (time spent executing user code) instead of wall clock time (which includes time waiting for I/O and scheduling). CPU time is deterministic regardless of host load.

struct rusage usage;
getrusage(RUSAGE_CHILDREN, &usage);
cpu_time_ms = usage.ru_utime.tv_sec * 1000 + usage.ru_utime.tv_usec / 1000;

Solution 3: Dedicated judge machines.

Run judge workers on dedicated machines that do not host any other workload. This is what Codeforces does during contests -- they provision bare-metal servers dedicated to judging.

LeetCode's approach: LeetCode normalizes execution times using a reference benchmark. They run a calibration program on each judge machine and compute a "speed factor." User execution times are divided by the speed factor to produce a normalized time. This allows different machine types in the judge pool.

Component 6: Codeforces Pretests/Systests Pattern

During a Codeforces contest, submissions are not judged against the full test suite.

Pretests (during contest):

A small subset of test cases (5-15) designed to catch common errors.
Fast judging: < 5 seconds per submission.
Purpose: give contestants immediate feedback.
NOT definitive: passing pretests does not guarantee correctness.

Systests (after contest ends):

Full test suite (50-200 test cases) including edge cases, stress tests, and anti-hack tests.
Takes 10-30 minutes to judge all submissions.
This is when the real verdict is determined. Many submissions that passed pretests fail systests.

Why this pattern?

During a 2-hour contest with 10K participants, each submitting 5 times per problem across 5 problems = 250K submissions. Judging each against 200 test cases = 50M executions. At 2 seconds each = 100M seconds of CPU time = 27,778 CPU-hours. In 2 hours, you need 13,889 CPU cores. With pretests (15 test cases), you need 15/200 * 13,889 = 1,042 cores. A 13x reduction in peak compute.

System design implications: The judge system has two modes:

Contest mode: Use pretests. Optimize for latency (< 5 seconds per verdict). Queue priority: contest submissions first.
Post-contest mode: Run systests. Optimize for throughput. Process in batch. Priority: none (batch processing).

Deep Dives

Online Code Judge — Code Judge Sandbox

Deep Dive 1: Syscall Filtering with seccomp

Beyond gVisor/Firecracker, an additional defense layer is seccomp (Secure Computing Mode), which filters system calls at the kernel level.

seccomp-bpf: A BPF (Berkeley Packet Filter) program that runs before every syscall. It can ALLOW, DENY, or KILL the process based on the syscall number and arguments.

Allowlist for a code judge:

ALLOWED syscalls:
  read, write, exit, exit_group        -- basic I/O
  brk, mmap, munmap, mprotect          -- memory management
  clock_gettime                         -- timing
  futex                                 -- threading

DENIED syscalls (kill process if attempted):
  execve     -- no spawning new processes (prevents shell escapes)
  fork, clone -- no forking (fork bomb prevention at syscall level)
  socket, connect, bind -- no networking
  open, openat (with write flags) -- no writing to filesystem
  ptrace     -- no debugging other processes
  mount      -- no mounting filesystems
  reboot     -- obviously not

Defense in depth: seccomp is the last line of defense after cgroups, namespaces, and gVisor/Firecracker. Even if the sandbox is compromised, seccomp prevents the most dangerous syscalls from reaching the kernel.

Deep Dive 2: Language-Specific Challenges

Each programming language has unique sandboxing challenges.

Python:

import os; os.system("rm -rf /") -- must block os.system, subprocess, ctypes.
Solution: seccomp blocks execve. The os.system call fails with EPERM.
BUT: Python's ctypes module can call libc functions directly, bypassing Python-level restrictions. seccomp at the kernel level is the only reliable defense.

Java:

The JVM itself is a large runtime. It needs 100+ MB of RAM just to start.
JVM startup time: 500-2000 ms. For a 2-second time limit, this is a significant fraction.
Solution: pre-warm JVMs. Keep a pool of started JVMs waiting for submissions. Load user code via classloader.
Java's SecurityManager (deprecated in JDK 17) was historically used for sandboxing. With its deprecation, external sandboxing (gVisor, seccomp) is essential.

C/C++:

Inline assembly allows direct syscall invocation, bypassing libc. syscall(SYS_socket, ...) can create a network socket even if socket() is restricted at the library level.
Solution: seccomp at the kernel level catches ALL syscalls, regardless of how they are invoked.
#include </etc/shadow> -- the preprocessor reads arbitrary files during compilation. The compilation sandbox must restrict filesystem access.

Go, Rust:

Statically linked binaries. No dependency on shared libraries. Clean to sandbox.
Go's goroutines use threads internally, so pids.max must be generous (200+) for Go programs.

Deep Dive 3: Submission Queue and Priority

During a contest, the judge must balance fairness, latency, and throughput.

Priority scheme:

Priority 1 (highest): Contest submissions from users who have not received any verdict yet.
                       These users are blocked, waiting for feedback.
Priority 2:           Contest submissions from users who have at least one verdict.
                       They can work on other problems while waiting.
Priority 3:           Practice submissions (non-contest).
Priority 4:           Custom test case runs (lowest priority).

Queue implementation: Redis sorted set where the score is (priority * 10^12) + submission_timestamp. Lower score = higher priority. Workers ZPOPMIN from the sorted set to get the next submission to judge.

Rate limiting: During contests, limit each user to 1 pending submission per problem. If they have a submission being judged, they cannot submit again until the verdict is returned. This prevents a single user from flooding the queue.

Rejudging: If a test case is found to be wrong (incorrect expected output), the system must rejudge all submissions for that problem. This is a batch operation that runs at Priority 3, after current contest submissions.

Alternative Designs

Approach	Pros	Cons	When to Use
gVisor + cgroups + seccomp (described above)	Strong isolation. Fast startup. Production-proven at Google.	gVisor overhead (5-15%). Complex setup.	LeetCode, HackerRank, production judges.
Firecracker microVMs	Hardware-level isolation. Near-native performance.	Higher startup time (125 ms). More memory overhead per VM.	Maximum security environments. AWS Lambda uses this.
Docker with seccomp profile	Simple. Familiar tooling. Adequate for low-risk environments.	Docker is NOT a security boundary. Container escape CVEs exist.	Internal tools, low-stakes coding challenges, prototype judges.
Remote code execution API (Judge0, Sphere Engine)	Fully managed. Zero infrastructure. API-based.	Latency (network round trip). Cost per execution. Rate limits. Less control.	Startups, hackathon projects, when judging is not core product.
WebAssembly (Wasm) sandbox	Memory-safe by design. No syscall access. Fast startup (< 1 ms).	Limited language support (C/C++/Rust compile to Wasm natively, but Java/Python do not). No filesystem or network access.	Browser-based judges. When you need the strongest sandbox with the narrowest language support.

Scaling Math Verification

Judge Worker Pool

Contest peak:                 200 submissions/sec
Test cases per submission:    15 (pretests during contest)
Executions/sec:               200 * 15 = 3,000
Execution time per test:      2 sec max, 0.5 sec avg
Concurrent executions:        3,000 * 0.5 = 1,500 concurrent
CPU cores per execution:      1
Workers needed:               1,500 cores / 16 cores per machine = 94 machines

With Firecracker:
  Memory per VM:              256 MB (submission) + 100 MB (VM overhead) = 356 MB
  VMs per machine (64 GB):    64,000 / 356 = ~180 concurrent VMs
  But CPU-limited to 16       -> 16 concurrent VMs per machine
  Machines needed:            1,500 / 16 = 94 machines (CPU-bound, not memory-bound)

Systests (Post-Contest)

Total submissions to systest:  250,000 (10K users * 5 subs/problem * 5 problems)
Test cases per submission:     200 (full suite)
Total executions:              250K * 200 = 50 million
Execution time:                0.5 sec avg
Total CPU time:                25M seconds = 6,944 CPU-hours
Time budget:                   1 hour (users want results quickly)
Cores needed:                  6,944 cores
Machines (16 cores):           434 machines

Cloud cost:                    434 * $0.10/hr * 1 hr = $43.40 per contest systest

Storage

Submission source code:       ~5 KB average
Compiled binary:              ~1 MB average
Test case data:               ~50 KB per test case * 1,000 problems = 50 MB
Submission metadata:          ~500 bytes per submission

Daily submissions:            ~100K
Daily storage:                100K * (5 KB + 1 MB + 500 B) = ~100 GB
Annual:                       36 TB
Retention policy:             Keep source code forever. Delete compiled binaries after 30 days.
Annual after cleanup:         ~2 TB

Failure Analysis

Failure	Impact	Mitigation
Sandbox escape	Attacker gains access to host. Can read other submissions, modify judge results, or pivot to other systems.	Defense in depth: gVisor/Firecracker + seccomp + cgroups + namespace isolation. Regular security audits. Bug bounty program. Immediate container/VM termination on anomaly.
Fork bomb exhausts host PIDs	Judge host cannot create new processes. All submissions on that host fail.	`pids.max` in cgroups limits per-submission PID count. seccomp blocks `fork()` entirely for single-threaded problems.
Compilation bomb (template explosion)	Compiler OOMs. Compilation worker crashes.	Memory and time limits on compilation (2 GB, 30 seconds). Kill compiler process on limit exceeded. Return CE verdict.
Judge worker crashes mid-execution	Submission gets no verdict. User waits indefinitely.	Timeout on the submission queue. If no verdict within 60 seconds, mark as "judge error" and requeue on a different worker.
Non-deterministic timing	Correct solution gets TLE on one run but passes on another. User complains.	CPU pinning + CPU time measurement (not wall time). Dedicated judge machines. Timing calibration factor. Allow 2 retries for borderline TLE.
Test case is wrong	Correct submissions marked as Wrong Answer.	Manual test case validation before contests. Allow contestants to challenge test cases (Codeforces "hack" system). Rejudge affected submissions.
Queue flooding during contest	One user submits 100 times per minute. Queue backs up for everyone.	Rate limit: 1 pending submission per user per problem. Queue priority: first-time submissions before retries.

Level Expectations

Level	What the Interviewer Expects
Mid (L4)	Docker container per submission. Time and memory limits via Docker flags. Sequential test case execution. Basic pass/fail verdict. Knows sandboxing is important but cannot articulate specific threats.
Senior (L5)	gVisor or Firecracker for sandbox isolation (not just Docker). cgroups v2 for resource limits including pids.max for fork bomb prevention. Separate compilation and execution phases with different resource limits. seccomp for syscall filtering. Queue with priority during contests. Deterministic timing via CPU time measurement.
Staff+ (L6)	Specific CVEs for container escape (CVE-2019-5736) and why Docker alone is insufficient. Compilation as an attack vector (template bombs, preprocessor file inclusion). Codeforces pretests/systests pattern with quantified compute savings. Language-specific sandboxing challenges (Python ctypes, Java SecurityManager deprecation, C inline assembly). Timing calibration across heterogeneous judge machines. WebAssembly as an alternative sandbox with trade-off analysis.

References from Our Courses

ZooKeeper Primitives — leader election and worker coordination for judge nodes
RabbitMQ, Kafka, and SQS — job queuing for submission processing
Delivery Guarantees — ensuring no submission is lost or judged twice

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design an Online Code Judge →