FFT: The Divide-and-Conquer Algorithm

Pillar: Chain --- "Split even/odd coefficients into two half-size polynomials. Each recursion level halves the problem --- the chain of $\log n$ levels gives $O(n \log n)$ total work."

The Setup

Ch11 L2 established the pipeline: convert to point-value form (DFT), multiply pointwise, convert back (IDFT). The bottleneck is the DFT itself, which is $O(n^2)$ when done naively.

The Fast Fourier Transform (FFT), discovered by Cooley and Tukey in 1965 (and arguably by Gauss a century earlier), computes the DFT in $O(n \log n)$. The trick exploits the halving property of roots of unity: squaring the $n$-th roots gives the $(n/2)$-th roots, reducing one size-$n$ DFT to two size-$n/2$ DFTs.

This single algorithm unlocks polynomial multiplication in $O(n \log n)$, which in turn unlocks fast convolution for counting problems, string matching, big integer multiplication, and dozens of other applications.

The Cooley-Tukey Insight

Take a polynomial $P(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_{n-1} x^{n-1}$ where $n$ is a power of $2$.

Split the coefficients by parity of index:

Even-indexed: $P_{\text{even}}(x) = a_0 + a_2 x + a_4 x^2 + \cdots$
Odd-indexed: $P_{\text{odd}}(x) = a_1 + a_3 x + a_5 x^2 + \cdots$

Then:

\[P(x) = P_{\text{even}}(x^2) + x \cdot P_{\text{odd}}(x^2)\]

Verify: $P_{\text{even}}(x^2)$ contributes terms at even positions ($a_0, a_2 x^2, a_4 x^4, \ldots$) and $x \cdot P_{\text{odd}}(x^2)$ contributes terms at odd positions ($a_1 x, a_3 x^3, a_5 x^5, \ldots$). Together they reconstruct $P(x)$.

Now evaluate $P$ at the $n$-th root of unity $\omega_n^k$:

\[P(\omega_n^k) = P_{\text{even}}(\omega_n^{2k}) + \omega_n^k \cdot P_{\text{odd}}(\omega_n^{2k})\]

By the halving property, $\omega_n^{2k} = \omega_{n/2}^k$. So evaluating $P$ at $n$ roots requires evaluating $P_{\text{even}}$ and $P_{\text{odd}}$ at $n/2$ roots. Each is a DFT of size $n/2$.

For $k \geq n/2$, use the cancellation property. When $k' = k - n/2$:

\[P(\omega_n^{k' + n/2}) = P_{\text{even}}(\omega_{n/2}^{k'}) - \omega_n^{k'} \cdot P_{\text{odd}}(\omega_{n/2}^{k'})\]

The minus sign comes from $\omega_n^{k' + n/2} = -\omega_n^{k'}$.

So for each $k$ in $0, \ldots, n/2 - 1$:

\[P(\omega_n^k) = P_{\text{even}}(\omega_{n/2}^k) + \omega_n^k \cdot P_{\text{odd}}(\omega_{n/2}^k)\]

\[P(\omega_n^{k + n/2}) = P_{\text{even}}(\omega_{n/2}^k) - \omega_n^k \cdot P_{\text{odd}}(\omega_{n/2}^k)\]

One pair of additions/subtractions gives you two outputs from the same inputs. This is the butterfly operation.

Recursive FFT: Trace on $n = 4$

$P(x) = 1 + 2x + 3x^2 + 4x^3$. Roots: $\omega_4^0 = 1$, $\omega_4^1 = i$, $\omega_4^2 = -1$, $\omega_4^3 = -i$.

Level 0: Split into even $(1, 3)$ and odd $(2, 4)$.

Level 1 (even): $P_{\text{even}}(x) = 1 + 3x$. Evaluate at $\omega_2^0 = 1, \omega_2^1 = -1$:

$P_{\text{even}}(1) = 4$
$P_{\text{even}}(-1) = -2$

Level 1 (odd): $P_{\text{odd}}(x) = 2 + 4x$. Evaluate at $\omega_2^0 = 1, \omega_2^1 = -1$:

$P_{\text{odd}}(1) = 6$
$P_{\text{odd}}(-1) = -2$

Combine (Level 0): Using the butterfly with $\omega_4^k$:

$k$	$P_{\text{even}}(\omega_2^k)$	$\omega_4^k \cdot P_{\text{odd}}(\omega_2^k)$	$P(\omega_4^k) = \text{sum}$	$P(\omega_4^{k+2}) = \text{diff}$
$0$	$4$	$1 \cdot 6 = 6$	$10$	$-2$
$1$	$-2$	$i \cdot (-2) = -2i$	$-2 - 2i$	$-2 + 2i$

Result: $(10, \; -2-2i, \; -2, \; -2+2i)$.

Matches the naive DFT from Ch11 L2. But instead of $4 \times 4 = 16$ multiplications, we used $4 + 2 + 2 = 8$ (roughly $n \log_2 n = 4 \times 2 = 8$).

The Butterfly Diagram

At each level of recursion, pairs of values are combined using the butterfly operation:

\[\text{upper}' = \text{upper} + \omega \cdot \text{lower}$$ $$\text{lower}' = \text{upper} - \omega \cdot \text{lower}\]

Visually, draw the input values on the left, output values on the right, with crossing lines between paired elements. The $\omega$ factor (called the twiddle factor) varies by position and level.

For $n = 8$, there are $\log_2 8 = 3$ levels. Each level has $n/2 = 4$ butterfly operations. Total: $3 \times 4 = 12$ complex multiplications, versus $64$ for the naive approach.

From Recursive to Iterative: Bit-Reversal

The recursive FFT is clean conceptually but inefficient in practice due to memory allocation at each level. The iterative version avoids recursion entirely.

The key observation: after all the even/odd splits, the base-level elements end up in bit-reversed order. For $n = 8$, the elements at positions $0, 1, 2, 3, 4, 5, 6, 7$ end up at positions $0, 4, 2, 6, 1, 5, 3, 7$.

In binary: $000, 001, 010, 011, 100, 101, 110, 111$ becomes $000, 100, 010, 110, 001, 101, 011, 111$. Each index is the bit-reversal of the original.

The iterative FFT:

Bit-reverse permute the input array.
Bottom-up butterfly: combine pairs of size $1$ into blocks of size $2$, then blocks of $2$ into $4$, then $4$ into $8$, etc.

using Complex = complex<double>;
const double PI = acos(-1.0);

void iterativeFFT(vector<Complex>& coefficients, bool inverse) {
    int length = coefficients.size();
    for (int i = 1, j = 0; i < length; i++) {
        int bit = length >> 1;
        for (; j & bit; bit >>= 1) {
            j ^= bit;
        }
        j ^= bit;
        if (i < j) swap(coefficients[i], coefficients[j]);
    }
    for (int blockSize = 2; blockSize <= length; blockSize <<= 1) {
        double angle = 2.0 * PI / blockSize * (inverse ? -1 : 1);
        Complex stepRoot(cos(angle), sin(angle));
        for (int blockStart = 0; blockStart < length; blockStart += blockSize) {
            Complex currentRoot(1, 0);
            for (int offset = 0; offset < blockSize / 2; offset++) {
                Complex upper = coefficients[blockStart + offset];
                Complex lower = coefficients[blockStart + offset + blockSize / 2] * currentRoot;
                coefficients[blockStart + offset] = upper + lower;
                coefficients[blockStart + offset + blockSize / 2] = upper - lower;
                currentRoot *= stepRoot;
            }
        }
    }
    if (inverse) {
        for (auto& value : coefficients) {
            value /= length;
        }
    }
}

The bit-reversal permutation in the first loop uses a standard trick: increment a "reversed" counter by toggling bits from the high end. This runs in $O(n)$ total.

The butterfly loops run $\log_2 n$ levels, each doing $n/2$ butterfly operations. Total: $O(n \log n)$.

The Full Multiplication Pipeline

Putting it all together --- multiply two polynomials in $O(n \log n)$:

vector<long long> multiplyPolynomials(const vector<long long>& polyA, const vector<long long>& polyB) {
    vector<Complex> transformA(polyA.begin(), polyA.end());
    vector<Complex> transformB(polyB.begin(), polyB.end());
    int resultSize = polyA.size() + polyB.size() - 1;
    int paddedLength = 1;
    while (paddedLength < resultSize) paddedLength <<= 1;
    transformA.resize(paddedLength);
    transformB.resize(paddedLength);
    iterativeFFT(transformA, false);
    iterativeFFT(transformB, false);
    for (int i = 0; i < paddedLength; i++) {
        transformA[i] *= transformB[i];
    }
    iterativeFFT(transformA, true);
    vector<long long> product(resultSize);
    for (int i = 0; i < resultSize; i++) {
        product[i] = llround(transformA[i].real());
    }
    return product;
}

Steps:

Pad both polynomials to the next power of $2$ that's at least $\deg A + \deg B + 1$.
Forward FFT on both.
Pointwise multiply in the frequency domain.
Inverse FFT on the product.
Round each coefficient to the nearest integer (to handle floating-point error).

Complexity: $O(n \log n)$ time, $O(n)$ space, where $n$ is the padded length.

Trace: Multiply $(1, 2, 3) \times (4, 5)$

Expected product: $(1 \cdot 4) + (1 \cdot 5 + 2 \cdot 4)x + (2 \cdot 5 + 3 \cdot 4)x^2 + (3 \cdot 5)x^3 = 4 + 13x + 22x^2 + 15x^3$.

Result size: $3 + 2 - 1 = 4$. Padded length: $4$ (already a power of $2$).

After padding: $A = (1, 2, 3, 0)$, $B = (4, 5, 0, 0)$.

After FFT of $A$: $(6, \; -2+2i, \; -2, \; -2-2i)$ (evaluated at $1, i, -1, -i$).

After FFT of $B$: $(9, \; 4-5i, \; -1, \; 4+5i)$.

Pointwise multiply:

$k$	$A[k]$	$B[k]$	$A[k] \cdot B[k]$
$0$	$6$	$9$	$54$
$1$	$-2+2i$	$4-5i$	$2+18i$
$2$	$-2$	$-1$	$2$
$3$	$-2-2i$	$4+5i$	$2-18i$

After inverse FFT: $(54 + 2 + 2 + 2)/4 = 15$... let's trust the algorithm and check the final output: $(4, 13, 22, 15)$. Correct.

Floating-Point Precision Issues

The elephant in the room: FFT uses complex floating-point arithmetic. Every multiplication accumulates rounding error.

When is this a problem?

Coefficients up to $\sim 10^9$ and polynomial length up to $\sim 10^5$: floating-point FFT with double works. The error per coefficient stays below $0.5$, so llround gives exact results.
Coefficients up to $\sim 10^{15}$ or length $\sim 10^6$: errors can exceed $0.5$. You need long double or split coefficients into two halves (low 15 bits and high 15 bits) and do three FFTs.
Answers needed modulo a prime: use NTT (Ch11 L4) for exact arithmetic with zero floating-point error.

Rule of thumb: if max_coefficient * polynomial_length stays below roughly $10^{15}$, double FFT is safe. Beyond that, switch to NTT or the split technique.

Why $O(n \log n)$: The Recurrence

Let $T(n)$ be the time to compute a DFT of size $n$. The FFT splits into two DFTs of size $n/2$ plus $O(n)$ work for the butterfly operations:

\[T(n) = 2T(n/2) + O(n)\]

By the Master Theorem (or by unrolling: $\log_2 n$ levels, each doing $O(n)$ work), this gives $T(n) = O(n \log n)$.

For $n = 2^{20} \approx 10^6$: the naive DFT does $10^{12}$ operations. FFT does $\sim 2 \times 10^7$. That's a $50{,}000\times$ speedup.

Common Mistakes

Not padding to a power of 2. The Cooley-Tukey algorithm requires $n$ to be a power of $2$ (for the standard radix-2 version). If your polynomial has $5$ coefficients, pad to $8$. Forgetting this causes out-of-bounds access or incorrect results.
Padding to the wrong size. You need $n \geq \deg A + \deg B + 1$, not $n \geq \max(\deg A, \deg B) + 1$. If $A$ has degree $3$ and $B$ has degree $4$, the product has degree $7$, so you need $n \geq 8$. Padding to $n = 4$ (the next power of $2$ after the larger input) causes aliasing.
Forgetting to divide by $n$ in the inverse FFT. The inverse DFT formula has a $\frac{1}{n}$ factor. If you skip it, all coefficients are scaled by $n$ and your answers are garbage. In the code above, the if (inverse) block handles this.
Using round instead of llround. The round function returns a double, which can lose precision for large integers. Use llround to get a long long directly.

Quick Recap

FFT computes the DFT in $O(n \log n)$ by splitting even/odd coefficients into two half-size DFTs (Cooley-Tukey).
The butterfly operation combines two half-size results: $\text{upper} \pm \omega \cdot \text{lower}$.
The iterative implementation uses bit-reversal permutation, then bottom-up butterfly passes.
Full multiplication pipeline: pad to power of $2$ $\to$ FFT both $\to$ pointwise multiply $\to$ inverse FFT $\to$ round.
Floating-point FFT works for coefficients up to $\sim 10^9$ and lengths up to $\sim 10^5$. For exact modular results, use NTT (Ch11 L4).

Problems

Problem	Link	Difficulty	Focus
CSES -- Polynomial Multiplication	cses.fi/problemset/task/2111	Medium	Direct FFT application
CF 632E -- Thief in a Shop	codeforces.com/problemset/problem/632/E	Hard	Polynomial exponentiation via repeated FFT multiplication
CF 1096G -- Lucky Tickets	codeforces.com/problemset/problem/1096/G	Hard	Count valid digit strings; reduce to polynomial squaring

Next lesson: NTT --- the same algorithm, but over a finite field. No floating point, exact modular arithmetic.

\(k\)	\(P_{\text{even}}(\omega_2^k)\)	\(\omega_4^k \cdot P_{\text{odd}}(\omega_2^k)\)	\(P(\omega_4^k) = \text{sum}\)	\(P(\omega_4^{k+2}) = \text{diff}\)
\(0\)	\(4\)	\(1 \cdot 6 = 6\)	\(10\)	\(-2\)
\(1\)	\(-2\)	\(i \cdot (-2) = -2i\)	\(-2 - 2i\)	\(-2 + 2i\)

\(k\)	\(A[k]\)	\(B[k]\)	\(A[k] \cdot B[k]\)
\(0\)	\(6\)	\(9\)	\(54\)
\(1\)	\(-2+2i\)	\(4-5i\)	\(2+18i\)
\(2\)	\(-2\)	\(-1\)	\(2\)
\(3\)	\(-2-2i\)	\(4+5i\)	\(2-18i\)