Find Top N Largest Files

Level: L3-L4 Topics: Trees, Heaps, BFS/DFS, API Design

Problem Statement

You are given a cloud object storage bucket URI (e.g., gs://my-bucket/). Write a program to find the top 10 largest files in the bucket.

The bucket has a hierarchical structure of directories and files. You are provided with two API calls:

List<String> storage.list(String uri)
    // Returns the immediate children (files and directories) of the given URI.
    // Directories end with '/'.

long storage.size(String uri)
    // Returns the size of a file in bytes.
    // Returns -1 if the URI is a directory.

Background & Constraints

The bucket can contain millions of files nested in a deep directory hierarchy.
Each API call has ~50ms latency. This is the dominant cost — minimize the number of sequential API calls.
storage.list() returns only immediate children, not recursive contents.
Directory URIs end with / (e.g., gs://my-bucket/data/), file URIs do not.
The output should be the 10 largest files with their sizes, sorted by size descending.

Examples

Bucket structure:

gs://my-bucket/
├── readme.txt          (1 KB)
├── data/
│   ├── train.csv       (2.5 GB)
│   ├── test.csv        (800 MB)
│   └── models/
│       ├── v1.bin       (4.1 GB)
│       └── v2.bin       (4.3 GB)
├── logs/
│   ├── 2024-01.log     (500 MB)
│   └── 2024-02.log     (600 MB)
└── backup.tar.gz       (10 GB)

Output (top 3 for brevity):

1. gs://my-bucket/backup.tar.gz         — 10.0 GB
2. gs://my-bucket/data/models/v2.bin    — 4.3 GB
3. gs://my-bucket/data/models/v1.bin    — 4.1 GB

Hints & Common Pitfalls

Naive approach: Recursive DFS — list a directory, recurse into subdirectories, call size() on each file, maintain a min-heap of size 10. This works but is very slow due to sequential API calls.
Parallelism is key: Since each API call takes 50ms, making them sequentially on a large bucket is prohibitively slow. Use a thread pool or async I/O to issue multiple list() and size() calls concurrently. A BFS approach with concurrent expansion of all directories at each level works well.
Heap for top-N: Use a min-heap of size 10. For each file, if its size is larger than the heap's minimum, replace the minimum. This keeps memory constant regardless of bucket size.
Distinguishing files from directories: Check if the URI ends with / (directory) or use the return value of size() (-1 for directories).
Common mistake: Calling size() on directories, or forgetting to handle empty directories (where list() returns an empty list).

Follow-Up Questions

Top 10 largest directories: Instead of files, find the top 10 largest directories (where a directory's size is the sum of all files recursively contained in it). How does this change the traversal strategy? Can you still prune early?
Latency optimization: If you have access to 100 threads, what is the theoretical minimum wall-clock time to scan a bucket with depth D and branching factor B? How close can your solution get?
Incremental updates: If the bucket changes over time, how would you efficiently update the top-10 list without rescanning the entire bucket?
Memory constraints: If the directory listing for a single directory is very large (millions of entries), how do you handle pagination in the list() API?