Supercomputer Schedule

Level: L3-L5 Topics: API Design, Data Structures, Heaps, Bitmaps, System Design

Problem Statement

Design a task scheduler for a supercomputer with N cores numbered 0 to N-1. Implement the following API:

assign_task() -> core_id: Find the lowest-numbered free core, mark it as busy, and return its ID. If no core is free, return -1 (or block, depending on design choice).
return_core(core_id): Mark the given core as free again.

Your solution should optimize for worst-case performance, not just average case. Assume single-threaded execution initially.

Background & Constraints

N can be large (e.g., thousands or millions of cores).
Operations should be as efficient as possible in the worst case.
The scheduler must always assign the lowest-numbered available core.
Initially, all cores are free.
return_core will only be called with a currently busy core ID.

Examples

N = 4 (cores: 0, 1, 2, 3 — all free)

assign_task() → 0   (core 0 now busy)
assign_task() → 1   (core 1 now busy)
return_core(0)       (core 0 now free)
assign_task() → 0   (core 0 is lowest free, now busy again)
assign_task() → 2   (cores 0,1 busy; core 2 is next)
assign_task() → 3
assign_task() → -1   (all cores busy)
return_core(1)
assign_task() → 1

Hints & Common Pitfalls

A simple boolean array gives O(N) for assign_task (linear scan) and O(1) for return_core. Can you do better?
Consider using a min-heap (priority queue) of free core IDs. This gives O(log N) for both operations and always yields the smallest available ID.
A sorted set (e.g., TreeSet in Java, std::set in C++) also works and provides O(log N) with easy access to the minimum.
Be careful about the "lowest-numbered" requirement — a regular queue or stack will not satisfy this.
Think about what happens when return_core is called with an already-free core. Should you guard against that?

Follow-Up Questions

Balance run time across cores: Instead of always assigning the lowest-numbered core, assign the core that has been used the least (cumulative run time). How does this change your data structure? Consider a min-heap keyed by total usage time, with core ID as a tiebreaker.
Time-out mechanism: Tasks have a maximum allowed duration. If a task exceeds its time limit, the core should be automatically freed. How would you implement periodic cleanup or lazy expiration?
Real-world optimizations (L5): For very large N, consider a bitmap with hierarchical indexing. A segment tree over a bit array can find the first free core in O(log N) while using compact memory. How would you design this?
System design extension (L5): If this scheduler runs as a distributed service handling requests from thousands of clients, how would you handle concurrency, consistency, and fault tolerance? What if a client crashes without returning a core?