Skip to content

The Pigeonhole Principle

Pillar: Set — "Map items to a set of buckets. The principle bounds the minimum bucket size." Pillar: Shape — "When the number of items exceeds distinct values, a collision is forced."


The Setup

You have an array of \(n + 1\) integers, each between \(1\) and \(n\). Prove that at least one value appears more than once.

No algorithm needed. You have \(n + 1\) items (array entries) and only \(n\) possible values (the buckets \(1, 2, \ldots, n\)). Some bucket must contain at least two items. A duplicate exists — guaranteed.

That one-sentence argument is the Pigeonhole Principle. It sounds trivial, but it powers a surprising range of results: cycle detection in linked lists, subarray divisibility problems, and existence proofs that no constructive method can easily replicate.


The Key Idea

Think in terms of the Set pillar: you have a set of buckets and a collection of items being placed into them. If there are more items than buckets, at least one bucket is non-empty twice. The Shape pillar adds a geometric angle: when a sequence of values is longer than the range of distinct values, the sequence must "fold back" on itself — a collision is inevitable.

The principle doesn't tell you which bucket overflows or which items collide. It only guarantees existence. That's exactly what makes it powerful — you get a conclusion without exhaustive search.


Basic Form

The Pigeonhole Principle (basic): If \(n + 1\) items are placed into \(n\) buckets, then at least one bucket contains \(2\) or more items.

Proof: suppose every bucket contains at most \(1\) item. Then the total number of items is at most \(n\). But you have \(n + 1\) items. Contradiction.


Generalized Form

The Generalized Pigeonhole Principle: If \(kn + 1\) items are placed into \(n\) buckets, then at least one bucket contains \(k + 1\) or more items.

Proof: suppose every bucket contains at most \(k\) items. Then the total is at most \(kn\). But you have \(kn + 1\) items. Contradiction.

Example: among \(13\) people, at least \(\lceil 13 / 12 \rceil = 2\) share the same birth month. Among \(25\) people, at least \(\lceil 25 / 12 \rceil = 3\) share the same birth month.

The general bound: with \(m\) items in \(n\) buckets, some bucket has at least \(\lceil m / n \rceil\) items.


Application 1 — Prefix Sums Mod \(K\)

Claim: Given any array of integers and a positive integer \(K\), there always exists a non-empty contiguous subarray whose sum is divisible by \(K\), provided the array has at least \(K\) elements.

Proof. Let the array be \(a_0, a_1, \ldots, a_{K-1}\) (at least \(K\) elements). Define prefix sums:

\[S_0 = 0, \quad S_i = a_0 + a_1 + \cdots + a_{i-1} \quad \text{for } i = 1, 2, \ldots, K\]

You have \(K + 1\) prefix sums (\(S_0\) through \(S_K\)) and only \(K\) possible remainders modulo \(K\) (namely \(0, 1, \ldots, K-1\)). By the Pigeonhole Principle, two prefix sums share the same remainder: \(S_i \equiv S_j \pmod{K}\) for some \(0 \leq i < j \leq K\).

Then \(S_j - S_i = a_i + a_{i+1} + \cdots + a_{j-1}\) is divisible by \(K\). That's the subarray from index \(i\) to \(j - 1\).

This proof is why LC 974 (Subarray Sums Divisible by K) works — you count pairs of prefix sums with equal remainders, not individual subarrays.


Application 2 — Floyd's Cycle Detection (Tortoise and Hare)

This is the main algorithmic application of the Pigeonhole Principle. It shows up in FAANG interviews constantly.

Why a Cycle Must Exist

Consider a function \(f: \{0, 1, \ldots, n\} \to \{1, 2, \ldots, n\}\). The domain has \(n + 1\) elements, the range has \(n\). Start at \(x_0 = 0\) and iterate: \(x_1 = f(0)\), \(x_2 = f(x_1)\), \(x_3 = f(x_2)\), and so on.

The sequence \(x_0, x_1, x_2, \ldots\) visits values in \(\{0, 1, \ldots, n\}\). Since the range of \(f\) has only \(n\) values, and each \(x_i\) for \(i \geq 1\) lives in \(\{1, \ldots, n\}\), the sequence must eventually revisit a value. Once it does, it cycles forever.

The sequence has a "tail" of length \(\mu\) (the distance from the start to the cycle entry) and a "cycle" of length \(\lambda\). The Pigeonhole Principle guarantees \(\lambda \geq 1\).

The Algorithm

Floyd's algorithm finds the cycle entry in \(O(n)\) time and \(O(1)\) space using two pointers:

Phase 1: Detect the meeting point.

  • Slow pointer moves one step at a time: \(\text{slow} = f(\text{slow})\).
  • Fast pointer moves two steps: \(\text{fast} = f(f(\text{fast}))\).
  • Both start at \(x_0\). They will meet inside the cycle.

Why they meet. Once both pointers are inside the cycle, the fast pointer gains one step per iteration on the slow pointer. The gap closes by \(1\) each step, so they must collide after at most \(\lambda\) steps inside the cycle.

Phase 2: Find the cycle entry.

  • Reset slow to the start (\(x_0\)). Keep fast where it is.
  • Advance both at speed \(1\).
  • They meet at the cycle entry.

Why this works. Suppose the slow pointer has taken \(t\) steps when they first meet in Phase 1. Then:

  • Slow is at position \(t\) in the sequence.
  • Fast is at position \(2t\) in the sequence.
  • Since they're at the same node: \(2t - t = t\) is a multiple of \(\lambda\). So \(t = m\lambda\) for some integer \(m\).

The cycle entry is at position \(\mu\). Starting from the meeting point (position \(t\) in the iteration), taking \(\mu\) more steps lands you at position \(t + \mu\). Since \(t\) is a multiple of \(\lambda\), position \(t + \mu \equiv \mu \pmod{\lambda}\) — that's the cycle entry.

Starting from \(x_0\), taking \(\mu\) steps also lands at the cycle entry. So both pointers reach the cycle entry simultaneously after \(\mu\) steps.

Full Trace

Array: \([1, 3, 4, 2, 2]\). Treat as \(f(i) = \text{numbers}[i]\).

The sequence starting at index \(0\): \(0 \to 1 \to 3 \to 2 \to 4 \to 2 \to 4 \to 2 \to \cdots\)

Cycle entry is at value \(2\) (index \(2\)). Tail length \(\mu = 2\), cycle length \(\lambda = 2\).

Phase 1:

Step Slow Fast
Start \(f(0) = 1\) \(f(f(0)) = f(1) = 3\)
1 \(f(1) = 3\) \(f(f(3)) = f(2) = 4\)
2 \(f(3) = 2\) \(f(f(4)) = f(2) = 4\)
3 \(f(2) = 4\) \(f(f(4)) = f(2) = 4\)

Meeting point: both at value \(4\).

Phase 2: Reset slow to \(f(0) = 1\).

Step Slow Fast
Start \(1\) \(4\)
1 \(f(1) = 3\) \(f(4) = 2\)
2 \(f(3) = 2\) \(f(2) = 4\)

Wait — let me retrace. The standard implementation starts both at numbers[0] for the initial value, then enters the do-while loop.

Let me redo this carefully with the standard code logic:

slow = numbers[0] = 1
fast = numbers[0] = 1

Iteration 1: slow = numbers[1] = 3, fast = numbers[numbers[1]] = numbers[3] = 2
Iteration 2: slow = numbers[3] = 2, fast = numbers[numbers[2]] = numbers[4] = 2

Slow \(= 2\), fast \(= 2\). They meet.

Phase 2: Reset slow to numbers[0] \(= 1\). Fast stays at \(2\).

Iteration 1: slow = numbers[1] = 3, fast = numbers[2] = 4
Iteration 2: slow = numbers[3] = 2, fast = numbers[4] = 2

Both at \(2\). The duplicate is \(2\).

\(O(1)\) space, \(O(n)\) time.


Application 3 — Existence Proofs

The Pigeonhole Principle is a workhorse for quick existence arguments:

Claim: There exist at least two people in New York City who have exactly the same number of hairs on their head.

Proof. A human head has at most about \(200{,}000\) hairs. New York City has over \(8\) million residents. There are \(8{,}000{,}000+\) items (people) and at most \(200{,}001\) buckets (possible hair counts from \(0\) to \(200{,}000\)). By the Pigeonhole Principle, at least two people share the same count.

In fact, the generalized form gives something stronger: at least \(\lceil 8{,}000{,}000 / 200{,}001 \rceil = 40\) people share the same hair count.

This style of reasoning — items vastly outnumber buckets — appears in interview questions whenever you need to argue "a collision must exist" without finding it explicitly.


Application 4 — Erdos-Szekeres Theorem

Theorem (Erdos-Szekeres): In any sequence of \(n^2 + 1\) distinct integers, there exists either an increasing subsequence of length \(n + 1\) or a decreasing subsequence of length \(n + 1\).

Proof sketch. For each element \(a_i\), let \(\ell_i\) be the length of the longest increasing subsequence ending at \(a_i\). If any \(\ell_i \geq n + 1\), we're done. Otherwise, \(\ell_i \in \{1, 2, \ldots, n\}\) for all \(i\).

Similarly, let \(d_i\) be the length of the longest decreasing subsequence ending at \(a_i\). If any \(d_i \geq n + 1\), we're done. Otherwise, \(d_i \in \{1, 2, \ldots, n\}\).

Assign each element the pair \((\ell_i, d_i)\). There are \(n^2 + 1\) elements and at most \(n \times n = n^2\) distinct pairs. By pigeonhole, two elements share the same pair: \((\ell_i, d_i) = (\ell_j, d_j)\) with \(i < j\).

But if \(a_i < a_j\), then \(\ell_j \geq \ell_i + 1\), contradicting \(\ell_i = \ell_j\). If \(a_i > a_j\), then \(d_j \geq d_i + 1\), contradicting \(d_i = d_j\). Either way, contradiction.

So our assumption that all \(\ell_i\) and \(d_i\) are at most \(n\) must fail.

This is the theoretical backbone behind LIS (Longest Increasing Subsequence) problems. The bound is tight: a sequence of length \(n^2\) can avoid both — arrange \(n\) decreasing blocks of \(n\) elements each.


Application 5 — The Infinite Pigeonhole Principle

Claim: In an infinite sequence of values drawn from a finite set, at least one value appears infinitely often.

Proof. Suppose the finite set has \(k\) values and each appears only finitely many times. Then the total number of terms is at most \(k \times (\text{maximum finite count})\), which is finite. But the sequence is infinite. Contradiction.

This version drives arguments in combinatorics and theoretical CS. For example: if a deterministic finite automaton processes an infinite input stream, it must revisit some state infinitely often — the automaton has finitely many states.


The Code

Floyd's Cycle Detection — Find the Duplicate Number (LC 287)

int findDuplicate(vector<int>& numbers) {
    int slow = numbers[0];
    int fast = numbers[0];
    do {
        slow = numbers[slow];
        fast = numbers[numbers[fast]];
    } while (slow != fast);
    slow = numbers[0];
    while (slow != fast) {
        slow = numbers[slow];
        fast = numbers[fast];
    }
    return slow;
}

\(O(n)\) time, \(O(1)\) space.

The array itself encodes the function \(f\). Index \(i\) maps to numbers[i]. Since values are in \([1, n]\) and there are \(n + 1\) entries, pigeonhole guarantees a cycle. The cycle entry is the duplicate value.

Subarray Sums Divisible by K (LC 974)

int subarraysDivByK(vector<int>& values, int divisor) {
    unordered_map<int, int> remainderCount;
    remainderCount[0] = 1;
    int prefixSum = 0;
    int totalCount = 0;
    for (int value : values) {
        prefixSum += value;
        int remainder = ((prefixSum % divisor) + divisor) % divisor;
        totalCount += remainderCount[remainder];
        remainderCount[remainder]++;
    }
    return totalCount;
}

\(O(n)\) time, \(O(K)\) space.

The double-mod ((prefixSum % divisor) + divisor) % divisor handles negative numbers. In C++, % can return negative values for negative operands. Adding divisor before the second % forces the result into \([0, K - 1]\).

For each new prefix sum, every previous prefix sum with the same remainder forms a valid subarray. The map counts how many previous prefix sums had each remainder.

Happy Number (LC 202)

A happy number is defined by the process: start with any positive integer, replace it by the sum of the squares of its digits, and repeat. If the process eventually reaches \(1\), the number is happy. Otherwise, it cycles forever.

The digit-square-sum of any number with at most \(d\) digits is at most \(81d\). For any starting value, the sequence quickly enters the range \([1, 81 \times 10] = [1, 810]\) (since inputs up to \(2^{31}\) have at most \(10\) digits). With only \(810\) possible values, the sequence must cycle (pigeonhole). Either the cycle includes \(1\) or it doesn't.

int digitSquareSum(int value) {
    int total = 0;
    while (value > 0) {
        int digit = value % 10;
        total += digit * digit;
        value /= 10;
    }
    return total;
}

bool isHappy(int startValue) {
    int slow = startValue;
    int fast = startValue;
    do {
        slow = digitSquareSum(slow);
        fast = digitSquareSum(digitSquareSum(fast));
    } while (slow != fast);
    return slow == 1;
}

\(O(\log n)\) per step (digit extraction), \(O(1)\) space.


Common Mistakes

  • Forgetting negative remainders. In C++, -7 % 5 is -2, not 3. For the subarray-sum problem, you must normalize: ((sum % k) + k) % k. Skipping this produces wrong bucket assignments and incorrect counts.

  • Applying Floyd's to the wrong graph. Floyd's cycle detection requires a functional graph — each node has exactly one outgoing edge. If your structure has branching (e.g., a general graph or a tree), the two-pointer trick doesn't apply.

  • Confusing "a duplicate exists" with "finding the duplicate." The Pigeonhole Principle only proves existence. To actually find the duplicate value in \(O(1)\) space, you need the full Floyd's algorithm — the principle alone won't hand you the answer.


Quick Recap

  • Basic pigeonhole: \(n + 1\) items into \(n\) buckets forces a bucket with \(\geq 2\) items.
  • Generalized: \(kn + 1\) items into \(n\) buckets forces a bucket with \(\geq k + 1\) items.
  • Prefix sums mod \(K\): \(K + 1\) prefix sums, \(K\) remainders — two must collide, giving a subarray divisible by \(K\).
  • Floyd's cycle detection exploits pigeonhole on functional graphs: slow pointer (1 step) and fast pointer (2 steps) meet inside the cycle, then a reset-and-walk phase finds the cycle entry. \(O(1)\) space, \(O(n)\) time.
  • Existence proofs (hairs in NYC, Erdos-Szekeres) use pigeonhole to guarantee collisions without constructing them.

Problems

Problem Link Difficulty Focus
LC 974 — Subarray Sums Divisible by K leetcode.com/problems/subarray-sums-divisible-by-k Medium Prefix sums mod \(K\); pigeonhole counting of remainder collisions
LC 202 — Happy Number leetcode.com/problems/happy-number Easy Floyd's cycle detection on the digit-square-sum function
LC 287 — Find the Duplicate Number leetcode.com/problems/find-the-duplicate-number Medium Floyd's cycle detection on array-as-function; the classic pigeonhole interview problem