Skip to content

FFT: The Divide-and-Conquer Algorithm

Pillar: Chain --- "Split even/odd coefficients into two half-size polynomials. Each recursion level halves the problem --- the chain of \(\log n\) levels gives \(O(n \log n)\) total work."


The Setup

Ch11 L2 established the pipeline: convert to point-value form (DFT), multiply pointwise, convert back (IDFT). The bottleneck is the DFT itself, which is \(O(n^2)\) when done naively.

The Fast Fourier Transform (FFT), discovered by Cooley and Tukey in 1965 (and arguably by Gauss a century earlier), computes the DFT in \(O(n \log n)\). The trick exploits the halving property of roots of unity: squaring the \(n\)-th roots gives the \((n/2)\)-th roots, reducing one size-\(n\) DFT to two size-\(n/2\) DFTs.

This single algorithm unlocks polynomial multiplication in \(O(n \log n)\), which in turn unlocks fast convolution for counting problems, string matching, big integer multiplication, and dozens of other applications.


The Cooley-Tukey Insight

Take a polynomial \(P(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_{n-1} x^{n-1}\) where \(n\) is a power of \(2\).

Split the coefficients by parity of index:

  • Even-indexed: \(P_{\text{even}}(x) = a_0 + a_2 x + a_4 x^2 + \cdots\)
  • Odd-indexed: \(P_{\text{odd}}(x) = a_1 + a_3 x + a_5 x^2 + \cdots\)

Then:

\[P(x) = P_{\text{even}}(x^2) + x \cdot P_{\text{odd}}(x^2)\]

Verify: \(P_{\text{even}}(x^2)\) contributes terms at even positions (\(a_0, a_2 x^2, a_4 x^4, \ldots\)) and \(x \cdot P_{\text{odd}}(x^2)\) contributes terms at odd positions (\(a_1 x, a_3 x^3, a_5 x^5, \ldots\)). Together they reconstruct \(P(x)\).

Now evaluate \(P\) at the \(n\)-th root of unity \(\omega_n^k\):

\[P(\omega_n^k) = P_{\text{even}}(\omega_n^{2k}) + \omega_n^k \cdot P_{\text{odd}}(\omega_n^{2k})\]

By the halving property, \(\omega_n^{2k} = \omega_{n/2}^k\). So evaluating \(P\) at \(n\) roots requires evaluating \(P_{\text{even}}\) and \(P_{\text{odd}}\) at \(n/2\) roots. Each is a DFT of size \(n/2\).

For \(k \geq n/2\), use the cancellation property. When \(k' = k - n/2\):

\[P(\omega_n^{k' + n/2}) = P_{\text{even}}(\omega_{n/2}^{k'}) - \omega_n^{k'} \cdot P_{\text{odd}}(\omega_{n/2}^{k'})\]

The minus sign comes from \(\omega_n^{k' + n/2} = -\omega_n^{k'}\).

So for each \(k\) in \(0, \ldots, n/2 - 1\):

\[P(\omega_n^k) = P_{\text{even}}(\omega_{n/2}^k) + \omega_n^k \cdot P_{\text{odd}}(\omega_{n/2}^k)\]
\[P(\omega_n^{k + n/2}) = P_{\text{even}}(\omega_{n/2}^k) - \omega_n^k \cdot P_{\text{odd}}(\omega_{n/2}^k)\]

One pair of additions/subtractions gives you two outputs from the same inputs. This is the butterfly operation.


Recursive FFT: Trace on \(n = 4\)

\(P(x) = 1 + 2x + 3x^2 + 4x^3\). Roots: \(\omega_4^0 = 1\), \(\omega_4^1 = i\), \(\omega_4^2 = -1\), \(\omega_4^3 = -i\).

Level 0: Split into even \((1, 3)\) and odd \((2, 4)\).

Level 1 (even): \(P_{\text{even}}(x) = 1 + 3x\). Evaluate at \(\omega_2^0 = 1, \omega_2^1 = -1\):

  • \(P_{\text{even}}(1) = 4\)
  • \(P_{\text{even}}(-1) = -2\)

Level 1 (odd): \(P_{\text{odd}}(x) = 2 + 4x\). Evaluate at \(\omega_2^0 = 1, \omega_2^1 = -1\):

  • \(P_{\text{odd}}(1) = 6\)
  • \(P_{\text{odd}}(-1) = -2\)

Combine (Level 0): Using the butterfly with \(\omega_4^k\):

\(k\) \(P_{\text{even}}(\omega_2^k)\) \(\omega_4^k \cdot P_{\text{odd}}(\omega_2^k)\) \(P(\omega_4^k) = \text{sum}\) \(P(\omega_4^{k+2}) = \text{diff}\)
\(0\) \(4\) \(1 \cdot 6 = 6\) \(10\) \(-2\)
\(1\) \(-2\) \(i \cdot (-2) = -2i\) \(-2 - 2i\) \(-2 + 2i\)

Result: \((10, \; -2-2i, \; -2, \; -2+2i)\).

Matches the naive DFT from Ch11 L2. But instead of \(4 \times 4 = 16\) multiplications, we used \(4 + 2 + 2 = 8\) (roughly \(n \log_2 n = 4 \times 2 = 8\)).


The Butterfly Diagram

At each level of recursion, pairs of values are combined using the butterfly operation:

\[\text{upper}' = \text{upper} + \omega \cdot \text{lower}$$ $$\text{lower}' = \text{upper} - \omega \cdot \text{lower}\]

Visually, draw the input values on the left, output values on the right, with crossing lines between paired elements. The \(\omega\) factor (called the twiddle factor) varies by position and level.

For \(n = 8\), there are \(\log_2 8 = 3\) levels. Each level has \(n/2 = 4\) butterfly operations. Total: \(3 \times 4 = 12\) complex multiplications, versus \(64\) for the naive approach.


From Recursive to Iterative: Bit-Reversal

The recursive FFT is clean conceptually but inefficient in practice due to memory allocation at each level. The iterative version avoids recursion entirely.

The key observation: after all the even/odd splits, the base-level elements end up in bit-reversed order. For \(n = 8\), the elements at positions \(0, 1, 2, 3, 4, 5, 6, 7\) end up at positions \(0, 4, 2, 6, 1, 5, 3, 7\).

In binary: \(000, 001, 010, 011, 100, 101, 110, 111\) becomes \(000, 100, 010, 110, 001, 101, 011, 111\). Each index is the bit-reversal of the original.

The iterative FFT:

  1. Bit-reverse permute the input array.
  2. Bottom-up butterfly: combine pairs of size \(1\) into blocks of size \(2\), then blocks of \(2\) into \(4\), then \(4\) into \(8\), etc.
using Complex = complex<double>;
const double PI = acos(-1.0);

void iterativeFFT(vector<Complex>& coefficients, bool inverse) {
    int length = coefficients.size();
    for (int i = 1, j = 0; i < length; i++) {
        int bit = length >> 1;
        for (; j & bit; bit >>= 1) {
            j ^= bit;
        }
        j ^= bit;
        if (i < j) swap(coefficients[i], coefficients[j]);
    }
    for (int blockSize = 2; blockSize <= length; blockSize <<= 1) {
        double angle = 2.0 * PI / blockSize * (inverse ? -1 : 1);
        Complex stepRoot(cos(angle), sin(angle));
        for (int blockStart = 0; blockStart < length; blockStart += blockSize) {
            Complex currentRoot(1, 0);
            for (int offset = 0; offset < blockSize / 2; offset++) {
                Complex upper = coefficients[blockStart + offset];
                Complex lower = coefficients[blockStart + offset + blockSize / 2] * currentRoot;
                coefficients[blockStart + offset] = upper + lower;
                coefficients[blockStart + offset + blockSize / 2] = upper - lower;
                currentRoot *= stepRoot;
            }
        }
    }
    if (inverse) {
        for (auto& value : coefficients) {
            value /= length;
        }
    }
}

The bit-reversal permutation in the first loop uses a standard trick: increment a "reversed" counter by toggling bits from the high end. This runs in \(O(n)\) total.

The butterfly loops run \(\log_2 n\) levels, each doing \(n/2\) butterfly operations. Total: \(O(n \log n)\).


The Full Multiplication Pipeline

Putting it all together --- multiply two polynomials in \(O(n \log n)\):

vector<long long> multiplyPolynomials(const vector<long long>& polyA, const vector<long long>& polyB) {
    vector<Complex> transformA(polyA.begin(), polyA.end());
    vector<Complex> transformB(polyB.begin(), polyB.end());
    int resultSize = polyA.size() + polyB.size() - 1;
    int paddedLength = 1;
    while (paddedLength < resultSize) paddedLength <<= 1;
    transformA.resize(paddedLength);
    transformB.resize(paddedLength);
    iterativeFFT(transformA, false);
    iterativeFFT(transformB, false);
    for (int i = 0; i < paddedLength; i++) {
        transformA[i] *= transformB[i];
    }
    iterativeFFT(transformA, true);
    vector<long long> product(resultSize);
    for (int i = 0; i < resultSize; i++) {
        product[i] = llround(transformA[i].real());
    }
    return product;
}

Steps:

  1. Pad both polynomials to the next power of \(2\) that's at least \(\deg A + \deg B + 1\).
  2. Forward FFT on both.
  3. Pointwise multiply in the frequency domain.
  4. Inverse FFT on the product.
  5. Round each coefficient to the nearest integer (to handle floating-point error).

Complexity: \(O(n \log n)\) time, \(O(n)\) space, where \(n\) is the padded length.


Trace: Multiply \((1, 2, 3) \times (4, 5)\)

Expected product: \((1 \cdot 4) + (1 \cdot 5 + 2 \cdot 4)x + (2 \cdot 5 + 3 \cdot 4)x^2 + (3 \cdot 5)x^3 = 4 + 13x + 22x^2 + 15x^3\).

Result size: \(3 + 2 - 1 = 4\). Padded length: \(4\) (already a power of \(2\)).

After padding: \(A = (1, 2, 3, 0)\), \(B = (4, 5, 0, 0)\).

After FFT of \(A\): \((6, \; -2+2i, \; -2, \; -2-2i)\) (evaluated at \(1, i, -1, -i\)).

After FFT of \(B\): \((9, \; 4-5i, \; -1, \; 4+5i)\).

Pointwise multiply:

\(k\) \(A[k]\) \(B[k]\) \(A[k] \cdot B[k]\)
\(0\) \(6\) \(9\) \(54\)
\(1\) \(-2+2i\) \(4-5i\) \(2+18i\)
\(2\) \(-2\) \(-1\) \(2\)
\(3\) \(-2-2i\) \(4+5i\) \(2-18i\)

After inverse FFT: \((54 + 2 + 2 + 2)/4 = 15\)... let's trust the algorithm and check the final output: \((4, 13, 22, 15)\). Correct.


Floating-Point Precision Issues

The elephant in the room: FFT uses complex floating-point arithmetic. Every multiplication accumulates rounding error.

When is this a problem?

  • Coefficients up to \(\sim 10^9\) and polynomial length up to \(\sim 10^5\): floating-point FFT with double works. The error per coefficient stays below \(0.5\), so llround gives exact results.
  • Coefficients up to \(\sim 10^{15}\) or length \(\sim 10^6\): errors can exceed \(0.5\). You need long double or split coefficients into two halves (low 15 bits and high 15 bits) and do three FFTs.
  • Answers needed modulo a prime: use NTT (Ch11 L4) for exact arithmetic with zero floating-point error.

Rule of thumb: if max_coefficient * polynomial_length stays below roughly \(10^{15}\), double FFT is safe. Beyond that, switch to NTT or the split technique.


Why \(O(n \log n)\): The Recurrence

Let \(T(n)\) be the time to compute a DFT of size \(n\). The FFT splits into two DFTs of size \(n/2\) plus \(O(n)\) work for the butterfly operations:

\[T(n) = 2T(n/2) + O(n)\]

By the Master Theorem (or by unrolling: \(\log_2 n\) levels, each doing \(O(n)\) work), this gives \(T(n) = O(n \log n)\).

For \(n = 2^{20} \approx 10^6\): the naive DFT does \(10^{12}\) operations. FFT does \(\sim 2 \times 10^7\). That's a \(50{,}000\times\) speedup.


Common Mistakes

  • Not padding to a power of 2. The Cooley-Tukey algorithm requires \(n\) to be a power of \(2\) (for the standard radix-2 version). If your polynomial has \(5\) coefficients, pad to \(8\). Forgetting this causes out-of-bounds access or incorrect results.

  • Padding to the wrong size. You need \(n \geq \deg A + \deg B + 1\), not \(n \geq \max(\deg A, \deg B) + 1\). If \(A\) has degree \(3\) and \(B\) has degree \(4\), the product has degree \(7\), so you need \(n \geq 8\). Padding to \(n = 4\) (the next power of \(2\) after the larger input) causes aliasing.

  • Forgetting to divide by \(n\) in the inverse FFT. The inverse DFT formula has a \(\frac{1}{n}\) factor. If you skip it, all coefficients are scaled by \(n\) and your answers are garbage. In the code above, the if (inverse) block handles this.

  • Using round instead of llround. The round function returns a double, which can lose precision for large integers. Use llround to get a long long directly.


Quick Recap

  • FFT computes the DFT in \(O(n \log n)\) by splitting even/odd coefficients into two half-size DFTs (Cooley-Tukey).
  • The butterfly operation combines two half-size results: \(\text{upper} \pm \omega \cdot \text{lower}\).
  • The iterative implementation uses bit-reversal permutation, then bottom-up butterfly passes.
  • Full multiplication pipeline: pad to power of \(2\) \(\to\) FFT both \(\to\) pointwise multiply \(\to\) inverse FFT \(\to\) round.
  • Floating-point FFT works for coefficients up to \(\sim 10^9\) and lengths up to \(\sim 10^5\). For exact modular results, use NTT (Ch11 L4).

Problems

Problem Link Difficulty Focus
CSES -- Polynomial Multiplication cses.fi/problemset/task/2111 Medium Direct FFT application
CF 632E -- Thief in a Shop codeforces.com/problemset/problem/632/E Hard Polynomial exponentiation via repeated FFT multiplication
CF 1096G -- Lucky Tickets codeforces.com/problemset/problem/1096/G Hard Count valid digit strings; reduce to polynomial squaring

Next lesson: NTT --- the same algorithm, but over a finite field. No floating point, exact modular arithmetic.