Find Top N Largest Files
Level: L3-L4 Topics: Trees, Heaps, BFS/DFS, API Design
Problem Statement
You are given a cloud object storage bucket URI (e.g., gs://my-bucket/). Write a program to find the top 10 largest files in the bucket.
The bucket has a hierarchical structure of directories and files. You are provided with two API calls:
List<String> storage.list(String uri)
// Returns the immediate children (files and directories) of the given URI.
// Directories end with '/'.
long storage.size(String uri)
// Returns the size of a file in bytes.
// Returns -1 if the URI is a directory.
Background & Constraints
- The bucket can contain millions of files nested in a deep directory hierarchy.
- Each API call has ~50ms latency. This is the dominant cost — minimize the number of sequential API calls.
storage.list()returns only immediate children, not recursive contents.- Directory URIs end with
/(e.g.,gs://my-bucket/data/), file URIs do not. - The output should be the 10 largest files with their sizes, sorted by size descending.
Examples
Bucket structure:
gs://my-bucket/
├── readme.txt (1 KB)
├── data/
│ ├── train.csv (2.5 GB)
│ ├── test.csv (800 MB)
│ └── models/
│ ├── v1.bin (4.1 GB)
│ └── v2.bin (4.3 GB)
├── logs/
│ ├── 2024-01.log (500 MB)
│ └── 2024-02.log (600 MB)
└── backup.tar.gz (10 GB)
Output (top 3 for brevity):
1. gs://my-bucket/backup.tar.gz — 10.0 GB
2. gs://my-bucket/data/models/v2.bin — 4.3 GB
3. gs://my-bucket/data/models/v1.bin — 4.1 GB
Hints & Common Pitfalls
-
Naive approach: Recursive DFS — list a directory, recurse into subdirectories, call
size()on each file, maintain a min-heap of size 10. This works but is very slow due to sequential API calls. -
Parallelism is key: Since each API call takes 50ms, making them sequentially on a large bucket is prohibitively slow. Use a thread pool or async I/O to issue multiple
list()andsize()calls concurrently. A BFS approach with concurrent expansion of all directories at each level works well. -
Heap for top-N: Use a min-heap of size 10. For each file, if its size is larger than the heap's minimum, replace the minimum. This keeps memory constant regardless of bucket size.
-
Distinguishing files from directories: Check if the URI ends with
/(directory) or use the return value ofsize()(-1 for directories). -
Common mistake: Calling
size()on directories, or forgetting to handle empty directories (wherelist()returns an empty list).
Follow-Up Questions
-
Top 10 largest directories: Instead of files, find the top 10 largest directories (where a directory's size is the sum of all files recursively contained in it). How does this change the traversal strategy? Can you still prune early?
-
Latency optimization: If you have access to 100 threads, what is the theoretical minimum wall-clock time to scan a bucket with depth D and branching factor B? How close can your solution get?
-
Incremental updates: If the bucket changes over time, how would you efficiently update the top-10 list without rescanning the entire bucket?
-
Memory constraints: If the directory listing for a single directory is very large (millions of entries), how do you handle pagination in the
list()API?