Find Strings with Prefix
Level: L4 Topics: Binary Search, Trie, String Matching, Sorted Arrays
Problem Statement
Given a sorted array of strings and a prefix string, find the count of strings in the array that start with the given prefix.
Background & Constraints
- The array is sorted in lexicographic order.
- The array can be very large (millions of strings).
- The prefix can be any string (including empty, which matches all strings).
- You need an efficient solution -- linear scan is not acceptable.
Examples
Example 1:
array = ["apple", "application", "apply", "banana", "band", "bandana", "cat"]
prefix = "app"
Result: 3 ("apple", "application", "apply")
Example 2:
array = ["apple", "application", "apply", "banana", "band", "bandana", "cat"]
prefix = "ban"
Result: 3 ("banana", "band", "bandana")
Example 3:
array = ["apple", "application", "apply", "banana", "band", "bandana", "cat"]
prefix = "dog"
Result: 0
Hints & Common Pitfalls
-
Binary search is the natural approach for a sorted array. Find the first string that is >= the prefix and the first string that is >= the prefix with its last character incremented. The count is the difference between these two positions.
-
Finding the lower bound: Use binary search to find the first index where
array[i] >= prefix. All strings with the prefix will start at or after this index. -
Finding the upper bound: You need the first index where strings no longer start with the prefix. One approach: construct the "next prefix" by incrementing the last character of the prefix (e.g.,
"app"becomes"apq"), then binary search for that. The count isupper - lower. -
Edge case with incrementing: If the prefix ends with
'z', incrementing is not as simple. You may need to drop the last character and increment the second-to-last, and so on. Alternatively, use a comparison function that directly checks the prefix relationship. -
Alternative upper bound approach: Binary search for the last string that starts with the prefix, then the count is
last - first + 1. Finding the "last" requires a modified binary search that checks prefix membership.
Follow-Up Questions
-
If the array is unsorted and you will receive hundreds of millions of prefix queries, what data structure would you build during preprocessing?
-
Trie-based solution: Build a Trie where each node stores the count of strings in its subtree. To answer a prefix query, traverse the Trie following the prefix characters and return the count stored at the final node. What is the time complexity per query?
-
What are the trade-offs between the binary search approach (on a sorted array) and the Trie approach? Consider memory usage, build time, query time, and cache performance.
-
How would you handle the case where strings can be added or removed dynamically? Which approach (sorted array vs. Trie) adapts more easily?
-
If the strings are very long and share many common prefixes, does a Trie save significant memory compared to storing all strings independently? What about a compressed Trie (Patricia/Radix tree)?