FFT: The Divide-and-Conquer Algorithm
Pillar: Chain --- "Split even/odd coefficients into two half-size polynomials. Each recursion level halves the problem --- the chain of \(\log n\) levels gives \(O(n \log n)\) total work."
The Setup
Ch11 L2 established the pipeline: convert to point-value form (DFT), multiply pointwise, convert back (IDFT). The bottleneck is the DFT itself, which is \(O(n^2)\) when done naively.
The Fast Fourier Transform (FFT), discovered by Cooley and Tukey in 1965 (and arguably by Gauss a century earlier), computes the DFT in \(O(n \log n)\). The trick exploits the halving property of roots of unity: squaring the \(n\)-th roots gives the \((n/2)\)-th roots, reducing one size-\(n\) DFT to two size-\(n/2\) DFTs.
This single algorithm unlocks polynomial multiplication in \(O(n \log n)\), which in turn unlocks fast convolution for counting problems, string matching, big integer multiplication, and dozens of other applications.
The Cooley-Tukey Insight
Take a polynomial \(P(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_{n-1} x^{n-1}\) where \(n\) is a power of \(2\).
Split the coefficients by parity of index:
- Even-indexed: \(P_{\text{even}}(x) = a_0 + a_2 x + a_4 x^2 + \cdots\)
- Odd-indexed: \(P_{\text{odd}}(x) = a_1 + a_3 x + a_5 x^2 + \cdots\)
Then:
Verify: \(P_{\text{even}}(x^2)\) contributes terms at even positions (\(a_0, a_2 x^2, a_4 x^4, \ldots\)) and \(x \cdot P_{\text{odd}}(x^2)\) contributes terms at odd positions (\(a_1 x, a_3 x^3, a_5 x^5, \ldots\)). Together they reconstruct \(P(x)\).
Now evaluate \(P\) at the \(n\)-th root of unity \(\omega_n^k\):
By the halving property, \(\omega_n^{2k} = \omega_{n/2}^k\). So evaluating \(P\) at \(n\) roots requires evaluating \(P_{\text{even}}\) and \(P_{\text{odd}}\) at \(n/2\) roots. Each is a DFT of size \(n/2\).
For \(k \geq n/2\), use the cancellation property. When \(k' = k - n/2\):
The minus sign comes from \(\omega_n^{k' + n/2} = -\omega_n^{k'}\).
So for each \(k\) in \(0, \ldots, n/2 - 1\):
One pair of additions/subtractions gives you two outputs from the same inputs. This is the butterfly operation.
Recursive FFT: Trace on \(n = 4\)
\(P(x) = 1 + 2x + 3x^2 + 4x^3\). Roots: \(\omega_4^0 = 1\), \(\omega_4^1 = i\), \(\omega_4^2 = -1\), \(\omega_4^3 = -i\).
Level 0: Split into even \((1, 3)\) and odd \((2, 4)\).
Level 1 (even): \(P_{\text{even}}(x) = 1 + 3x\). Evaluate at \(\omega_2^0 = 1, \omega_2^1 = -1\):
- \(P_{\text{even}}(1) = 4\)
- \(P_{\text{even}}(-1) = -2\)
Level 1 (odd): \(P_{\text{odd}}(x) = 2 + 4x\). Evaluate at \(\omega_2^0 = 1, \omega_2^1 = -1\):
- \(P_{\text{odd}}(1) = 6\)
- \(P_{\text{odd}}(-1) = -2\)
Combine (Level 0): Using the butterfly with \(\omega_4^k\):
| \(k\) | \(P_{\text{even}}(\omega_2^k)\) | \(\omega_4^k \cdot P_{\text{odd}}(\omega_2^k)\) | \(P(\omega_4^k) = \text{sum}\) | \(P(\omega_4^{k+2}) = \text{diff}\) |
|---|---|---|---|---|
| \(0\) | \(4\) | \(1 \cdot 6 = 6\) | \(10\) | \(-2\) |
| \(1\) | \(-2\) | \(i \cdot (-2) = -2i\) | \(-2 - 2i\) | \(-2 + 2i\) |
Result: \((10, \; -2-2i, \; -2, \; -2+2i)\).
Matches the naive DFT from Ch11 L2. But instead of \(4 \times 4 = 16\) multiplications, we used \(4 + 2 + 2 = 8\) (roughly \(n \log_2 n = 4 \times 2 = 8\)).
The Butterfly Diagram
At each level of recursion, pairs of values are combined using the butterfly operation:
Visually, draw the input values on the left, output values on the right, with crossing lines between paired elements. The \(\omega\) factor (called the twiddle factor) varies by position and level.
For \(n = 8\), there are \(\log_2 8 = 3\) levels. Each level has \(n/2 = 4\) butterfly operations. Total: \(3 \times 4 = 12\) complex multiplications, versus \(64\) for the naive approach.
From Recursive to Iterative: Bit-Reversal
The recursive FFT is clean conceptually but inefficient in practice due to memory allocation at each level. The iterative version avoids recursion entirely.
The key observation: after all the even/odd splits, the base-level elements end up in bit-reversed order. For \(n = 8\), the elements at positions \(0, 1, 2, 3, 4, 5, 6, 7\) end up at positions \(0, 4, 2, 6, 1, 5, 3, 7\).
In binary: \(000, 001, 010, 011, 100, 101, 110, 111\) becomes \(000, 100, 010, 110, 001, 101, 011, 111\). Each index is the bit-reversal of the original.
The iterative FFT:
- Bit-reverse permute the input array.
- Bottom-up butterfly: combine pairs of size \(1\) into blocks of size \(2\), then blocks of \(2\) into \(4\), then \(4\) into \(8\), etc.
using Complex = complex<double>;
const double PI = acos(-1.0);
void iterativeFFT(vector<Complex>& coefficients, bool inverse) {
int length = coefficients.size();
for (int i = 1, j = 0; i < length; i++) {
int bit = length >> 1;
for (; j & bit; bit >>= 1) {
j ^= bit;
}
j ^= bit;
if (i < j) swap(coefficients[i], coefficients[j]);
}
for (int blockSize = 2; blockSize <= length; blockSize <<= 1) {
double angle = 2.0 * PI / blockSize * (inverse ? -1 : 1);
Complex stepRoot(cos(angle), sin(angle));
for (int blockStart = 0; blockStart < length; blockStart += blockSize) {
Complex currentRoot(1, 0);
for (int offset = 0; offset < blockSize / 2; offset++) {
Complex upper = coefficients[blockStart + offset];
Complex lower = coefficients[blockStart + offset + blockSize / 2] * currentRoot;
coefficients[blockStart + offset] = upper + lower;
coefficients[blockStart + offset + blockSize / 2] = upper - lower;
currentRoot *= stepRoot;
}
}
}
if (inverse) {
for (auto& value : coefficients) {
value /= length;
}
}
}
The bit-reversal permutation in the first loop uses a standard trick: increment a "reversed" counter by toggling bits from the high end. This runs in \(O(n)\) total.
The butterfly loops run \(\log_2 n\) levels, each doing \(n/2\) butterfly operations. Total: \(O(n \log n)\).
The Full Multiplication Pipeline
Putting it all together --- multiply two polynomials in \(O(n \log n)\):
vector<long long> multiplyPolynomials(const vector<long long>& polyA, const vector<long long>& polyB) {
vector<Complex> transformA(polyA.begin(), polyA.end());
vector<Complex> transformB(polyB.begin(), polyB.end());
int resultSize = polyA.size() + polyB.size() - 1;
int paddedLength = 1;
while (paddedLength < resultSize) paddedLength <<= 1;
transformA.resize(paddedLength);
transformB.resize(paddedLength);
iterativeFFT(transformA, false);
iterativeFFT(transformB, false);
for (int i = 0; i < paddedLength; i++) {
transformA[i] *= transformB[i];
}
iterativeFFT(transformA, true);
vector<long long> product(resultSize);
for (int i = 0; i < resultSize; i++) {
product[i] = llround(transformA[i].real());
}
return product;
}
Steps:
- Pad both polynomials to the next power of \(2\) that's at least \(\deg A + \deg B + 1\).
- Forward FFT on both.
- Pointwise multiply in the frequency domain.
- Inverse FFT on the product.
- Round each coefficient to the nearest integer (to handle floating-point error).
Complexity: \(O(n \log n)\) time, \(O(n)\) space, where \(n\) is the padded length.
Trace: Multiply \((1, 2, 3) \times (4, 5)\)
Expected product: \((1 \cdot 4) + (1 \cdot 5 + 2 \cdot 4)x + (2 \cdot 5 + 3 \cdot 4)x^2 + (3 \cdot 5)x^3 = 4 + 13x + 22x^2 + 15x^3\).
Result size: \(3 + 2 - 1 = 4\). Padded length: \(4\) (already a power of \(2\)).
After padding: \(A = (1, 2, 3, 0)\), \(B = (4, 5, 0, 0)\).
After FFT of \(A\): \((6, \; -2+2i, \; -2, \; -2-2i)\) (evaluated at \(1, i, -1, -i\)).
After FFT of \(B\): \((9, \; 4-5i, \; -1, \; 4+5i)\).
Pointwise multiply:
| \(k\) | \(A[k]\) | \(B[k]\) | \(A[k] \cdot B[k]\) |
|---|---|---|---|
| \(0\) | \(6\) | \(9\) | \(54\) |
| \(1\) | \(-2+2i\) | \(4-5i\) | \(2+18i\) |
| \(2\) | \(-2\) | \(-1\) | \(2\) |
| \(3\) | \(-2-2i\) | \(4+5i\) | \(2-18i\) |
After inverse FFT: \((54 + 2 + 2 + 2)/4 = 15\)... let's trust the algorithm and check the final output: \((4, 13, 22, 15)\). Correct.
Floating-Point Precision Issues
The elephant in the room: FFT uses complex floating-point arithmetic. Every multiplication accumulates rounding error.
When is this a problem?
- Coefficients up to \(\sim 10^9\) and polynomial length up to \(\sim 10^5\): floating-point FFT with
doubleworks. The error per coefficient stays below \(0.5\), sollroundgives exact results. - Coefficients up to \(\sim 10^{15}\) or length \(\sim 10^6\): errors can exceed \(0.5\). You need
long doubleor split coefficients into two halves (low 15 bits and high 15 bits) and do three FFTs. - Answers needed modulo a prime: use NTT (Ch11 L4) for exact arithmetic with zero floating-point error.
Rule of thumb: if max_coefficient * polynomial_length stays below roughly \(10^{15}\), double FFT is safe. Beyond that, switch to NTT or the split technique.
Why \(O(n \log n)\): The Recurrence
Let \(T(n)\) be the time to compute a DFT of size \(n\). The FFT splits into two DFTs of size \(n/2\) plus \(O(n)\) work for the butterfly operations:
By the Master Theorem (or by unrolling: \(\log_2 n\) levels, each doing \(O(n)\) work), this gives \(T(n) = O(n \log n)\).
For \(n = 2^{20} \approx 10^6\): the naive DFT does \(10^{12}\) operations. FFT does \(\sim 2 \times 10^7\). That's a \(50{,}000\times\) speedup.
Common Mistakes
-
Not padding to a power of 2. The Cooley-Tukey algorithm requires \(n\) to be a power of \(2\) (for the standard radix-2 version). If your polynomial has \(5\) coefficients, pad to \(8\). Forgetting this causes out-of-bounds access or incorrect results.
-
Padding to the wrong size. You need \(n \geq \deg A + \deg B + 1\), not \(n \geq \max(\deg A, \deg B) + 1\). If \(A\) has degree \(3\) and \(B\) has degree \(4\), the product has degree \(7\), so you need \(n \geq 8\). Padding to \(n = 4\) (the next power of \(2\) after the larger input) causes aliasing.
-
Forgetting to divide by \(n\) in the inverse FFT. The inverse DFT formula has a \(\frac{1}{n}\) factor. If you skip it, all coefficients are scaled by \(n\) and your answers are garbage. In the code above, the
if (inverse)block handles this. -
Using
roundinstead ofllround. Theroundfunction returns adouble, which can lose precision for large integers. Usellroundto get along longdirectly.
Quick Recap
- FFT computes the DFT in \(O(n \log n)\) by splitting even/odd coefficients into two half-size DFTs (Cooley-Tukey).
- The butterfly operation combines two half-size results: \(\text{upper} \pm \omega \cdot \text{lower}\).
- The iterative implementation uses bit-reversal permutation, then bottom-up butterfly passes.
- Full multiplication pipeline: pad to power of \(2\) \(\to\) FFT both \(\to\) pointwise multiply \(\to\) inverse FFT \(\to\) round.
- Floating-point FFT works for coefficients up to \(\sim 10^9\) and lengths up to \(\sim 10^5\). For exact modular results, use NTT (Ch11 L4).
Problems
| Problem | Link | Difficulty | Focus |
|---|---|---|---|
| CSES -- Polynomial Multiplication | cses.fi/problemset/task/2111 | Medium | Direct FFT application |
| CF 632E -- Thief in a Shop | codeforces.com/problemset/problem/632/E | Hard | Polynomial exponentiation via repeated FFT multiplication |
| CF 1096G -- Lucky Tickets | codeforces.com/problemset/problem/1096/G | Hard | Count valid digit strings; reduce to polynomial squaring |
Next lesson: NTT --- the same algorithm, but over a finite field. No floating point, exact modular arithmetic.