Find Strings with Prefix

Level: L4 Topics: Binary Search, Trie, String Matching, Sorted Arrays

Problem Statement

Given a sorted array of strings and a prefix string, find the count of strings in the array that start with the given prefix.

Example 1:

array  = ["apple", "application", "apply", "banana", "band", "bandana", "cat"]
prefix = "app"

Result: 3  ("apple", "application", "apply")

Example 2:

array  = ["apple", "application", "apply", "banana", "band", "bandana", "cat"]
prefix = "ban"

Result: 3  ("banana", "band", "bandana")

Example 3:

array  = ["apple", "application", "apply", "banana", "band", "bandana", "cat"]
prefix = "dog"

Result: 0

Binary search is the natural approach for a sorted array. Find the first string that is >= the prefix and the first string that is >= the prefix with its last character incremented. The count is the difference between these two positions.
Finding the lower bound: Use binary search to find the first index where array[i] >= prefix. All strings with the prefix will start at or after this index.
Finding the upper bound: You need the first index where strings no longer start with the prefix. One approach: construct the "next prefix" by incrementing the last character of the prefix (e.g., "app" becomes "apq"), then binary search for that. The count is upper - lower.
Edge case with incrementing: If the prefix ends with 'z', incrementing is not as simple. You may need to drop the last character and increment the second-to-last, and so on. Alternatively, use a comparison function that directly checks the prefix relationship.
Alternative upper bound approach: Binary search for the last string that starts with the prefix, then the count is last - first + 1. Finding the "last" requires a modified binary search that checks prefix membership.

If the array is unsorted and you will receive hundreds of millions of prefix queries, what data structure would you build during preprocessing?
Trie-based solution: Build a Trie where each node stores the count of strings in its subtree. To answer a prefix query, traverse the Trie following the prefix characters and return the count stored at the final node. What is the time complexity per query?
What are the trade-offs between the binary search approach (on a sorted array) and the Trie approach? Consider memory usage, build time, query time, and cache performance.
How would you handle the case where strings can be added or removed dynamically? Which approach (sorted array vs. Trie) adapts more easily?
If the strings are very long and share many common prefixes, does a Trie save significant memory compared to storing all strings independently? What about a compressed Trie (Patricia/Radix tree)?