Skip to content

Find Top N Largest Files

Level: L3-L4 Topics: Trees, Heaps, BFS/DFS, API Design

Problem Statement

You are given a cloud object storage bucket URI (e.g., gs://my-bucket/). Write a program to find the top 10 largest files in the bucket.

The bucket has a hierarchical structure of directories and files. You are provided with two API calls:

List<String> storage.list(String uri)
    // Returns the immediate children (files and directories) of the given URI.
    // Directories end with '/'.

long storage.size(String uri)
    // Returns the size of a file in bytes.
    // Returns -1 if the URI is a directory.

Background & Constraints

  • The bucket can contain millions of files nested in a deep directory hierarchy.
  • Each API call has ~50ms latency. This is the dominant cost — minimize the number of sequential API calls.
  • storage.list() returns only immediate children, not recursive contents.
  • Directory URIs end with / (e.g., gs://my-bucket/data/), file URIs do not.
  • The output should be the 10 largest files with their sizes, sorted by size descending.

Examples

Bucket structure:

gs://my-bucket/
├── readme.txt          (1 KB)
├── data/
│   ├── train.csv       (2.5 GB)
│   ├── test.csv        (800 MB)
│   └── models/
│       ├── v1.bin       (4.1 GB)
│       └── v2.bin       (4.3 GB)
├── logs/
│   ├── 2024-01.log     (500 MB)
│   └── 2024-02.log     (600 MB)
└── backup.tar.gz       (10 GB)

Output (top 3 for brevity):

1. gs://my-bucket/backup.tar.gz         — 10.0 GB
2. gs://my-bucket/data/models/v2.bin    — 4.3 GB
3. gs://my-bucket/data/models/v1.bin    — 4.1 GB

Hints & Common Pitfalls

  • Naive approach: Recursive DFS — list a directory, recurse into subdirectories, call size() on each file, maintain a min-heap of size 10. This works but is very slow due to sequential API calls.

  • Parallelism is key: Since each API call takes 50ms, making them sequentially on a large bucket is prohibitively slow. Use a thread pool or async I/O to issue multiple list() and size() calls concurrently. A BFS approach with concurrent expansion of all directories at each level works well.

  • Heap for top-N: Use a min-heap of size 10. For each file, if its size is larger than the heap's minimum, replace the minimum. This keeps memory constant regardless of bucket size.

  • Distinguishing files from directories: Check if the URI ends with / (directory) or use the return value of size() (-1 for directories).

  • Common mistake: Calling size() on directories, or forgetting to handle empty directories (where list() returns an empty list).

Follow-Up Questions

  1. Top 10 largest directories: Instead of files, find the top 10 largest directories (where a directory's size is the sum of all files recursively contained in it). How does this change the traversal strategy? Can you still prune early?

  2. Latency optimization: If you have access to 100 threads, what is the theoretical minimum wall-clock time to scan a bucket with depth D and branching factor B? How close can your solution get?

  3. Incremental updates: If the bucket changes over time, how would you efficiently update the top-10 list without rescanning the entire bucket?

  4. Memory constraints: If the directory listing for a single directory is very large (millions of entries), how do you handle pagination in the list() API?