NTT and Modular Convolution

Pillar: Set --- "NTT works over a finite field --- the SET of integers mod a prime with a primitive root. No floating point, exact arithmetic."

The Setup

The FFT from Ch11 L3 works perfectly, but it has a flaw: floating-point arithmetic. When a problem asks for the answer modulo \(10^9 + 7\) or modulo \(998244353\), you compute in complex numbers, round, then take the modulus. For large polynomials or large coefficients, rounding errors creep in and your answers are wrong.

The fix is to replace complex numbers with modular arithmetic entirely. Instead of roots of unity on the complex unit circle, use roots of unity in a finite field \(\mathbb{Z}/p\mathbb{Z}\). The algorithm structure is identical --- same bit-reversal, same butterfly, same \(O(n \log n)\) complexity --- but every operation is exact integer arithmetic modulo a prime.

This is the Number Theoretic Transform (NTT).

The Key Idea: Roots of Unity in Finite Fields

In Ch11 L2, the \(n\)-th roots of unity were complex numbers \(\omega\) with \(\omega^n = 1\), evenly spaced on the unit circle. The FFT needs three properties: cancellation, halving, and orthogonality.

Those same three properties exist in modular arithmetic. If \(p\) is a prime, the multiplicative group \((\mathbb{Z}/p\mathbb{Z})^*\) has order \(p - 1\). A primitive root modulo \(p\) is an element \(g\) whose powers generate every nonzero element: \(g^1, g^2, \ldots, g^{p-1} = 1 \pmod{p}\).

If \(n\) divides \(p - 1\), then \(\omega = g^{(p-1)/n}\) is a primitive \(n\)-th root of unity modulo \(p\):

\(\omega^n = g^{p-1} = 1 \pmod{p}\) (by Fermat's little theorem)
\(\omega^k \neq 1\) for \(0 < k < n\) (because \(g\) is a primitive root)
The cancellation, halving, and summation properties all carry over, with every operation done modulo \(p\).

The entire FFT algorithm works unchanged. Replace complex<double> with long long, replace complex multiplication with modular multiplication, and replace \(e^{2\pi i / n}\) with \(g^{(p-1)/n} \bmod p\).

Why 998244353?

Not every prime works for NTT. You need \(n\) (a power of \(2\)) to divide \(p - 1\), so that primitive \(n\)-th roots of unity exist modulo \(p\).

\(998244353 = 119 \times 2^{23} + 1\).

Since \(p - 1 = 998244352 = 119 \times 2^{23}\), the factor of \(2^{23}\) means you can do NTT with \(n\) up to \(2^{23} = 8{,}388{,}608\). That's polynomials of degree up to \(\sim 8\) million. More than enough for competitive programming.

The primitive root of \(998244353\) is \(g = 3\). Verify: \(3\) generates the full multiplicative group modulo \(998244353\).

Other NTT-friendly primes:

Prime	Factorization of \(p - 1\)	Max \(n\)	Primitive root
\(998244353\)	\(119 \times 2^{23}\)	\(2^{23}\)	\(3\)
\(985661441\)	\(235 \times 2^{22}\)	\(2^{22}\)	\(3\)
\(754974721\)	\(45 \times 2^{24}\)	\(2^{24}\)	\(11\)
\(469762049\)	\(7 \times 2^{26}\)	\(2^{26}\)	\(3\)

\(998244353\) is the most common in competitive programming because \(2^{23}\) is large enough for typical constraints and \(3\) is a convenient primitive root.

Formal Definition: The Number Theoretic Transform

Given a prime \(p\) with primitive root \(g\), and \(n\) a power of \(2\) dividing \(p - 1\), define \(\omega = g^{(p-1)/n} \bmod p\).

The Number Theoretic Transform of a coefficient vector \((a_0, a_1, \ldots, a_{n-1})\) is:

\[y_k = \sum_{j=0}^{n-1} a_j \cdot \omega^{jk} \pmod{p}, \quad k = 0, 1, \ldots, n-1\]

The inverse NTT is:

\[a_j = n^{-1} \sum_{k=0}^{n-1} y_k \cdot \omega^{-jk} \pmod{p}, \quad j = 0, 1, \ldots, n-1\]

Here \(n^{-1}\) is the modular inverse of \(n\) modulo \(p\), computed as \(n^{p-2} \bmod p\) using the modpow function from Ch4 L2.

And \(\omega^{-1} = g^{(p-1) - (p-1)/n} \bmod p\), the modular inverse of the root.

Worked Example: NTT with \(p = 5\), \(n = 4\)

\(p = 5\), \(p - 1 = 4 = 2^2\). Primitive root: \(g = 2\) (since \(2^1 = 2, 2^2 = 4, 2^3 = 3, 2^4 = 1 \pmod{5}\)).

\(\omega = g^{(p-1)/n} = 2^{4/4} = 2 \pmod{5}\).

The four roots of unity: \(\omega^0 = 1, \omega^1 = 2, \omega^2 = 4, \omega^3 = 3 \pmod{5}\).

Check: \(\omega^4 = 2^4 = 16 \equiv 1 \pmod{5}\). Good.

Transform \((a_0, a_1, a_2, a_3) = (1, 2, 3, 4)\):

\(k\)	\(y_k = \sum_{j} a_j \omega^{jk} \bmod 5\)	Computation	Result
\(0\)	\(1 + 2 + 3 + 4\)	\(10 \bmod 5\)	\(0\)
\(1\)	\(1 + 2 \cdot 2 + 3 \cdot 4 + 4 \cdot 3\)	\(1 + 4 + 12 + 12 = 29 \bmod 5\)	\(4\)
\(2\)	\(1 + 2 \cdot 4 + 3 \cdot 1 + 4 \cdot 4\)	\(1 + 8 + 3 + 16 = 28 \bmod 5\)	\(3\)
\(3\)	\(1 + 2 \cdot 3 + 3 \cdot 4 + 4 \cdot 2\)	\(1 + 6 + 12 + 8 = 27 \bmod 5\)	\(2\)

NTT output: \((0, 4, 3, 2)\).

Inverse: \(n^{-1} = 4^{-1} \bmod 5 = 4\) (since \(4 \cdot 4 = 16 \equiv 1\)). \(\omega^{-1} = 2^{-1} \bmod 5 = 3\) (since \(2 \cdot 3 = 6 \equiv 1\)).

\[a_0 = 4 \cdot (0 + 4 + 3 + 2) \bmod 5 = 4 \cdot 9 \bmod 5 = 36 \bmod 5 = 1 \;\checkmark\]

The same computation recovers all four original coefficients. Exact. No rounding.

The Implementation

The NTT code is structurally identical to the iterative FFT from Ch11 L3. The differences:

complex<double> becomes long long.
The twiddle factor \(e^{2\pi i / \text{blockSize}}\) becomes \(g^{(p-1)/\text{blockSize}} \bmod p\).
Addition/subtraction/multiplication are modular.
Division by \(n\) in the inverse uses modular inverse.

const long long MOD = 998244353;
const long long PRIMITIVE_ROOT = 3;

long long modpow(long long base, long long exponent, long long mod) {
    long long result = 1;
    base %= mod;
    while (exponent > 0) {
        if (exponent & 1) result = result * base % mod;
        base = base * base % mod;
        exponent >>= 1;
    }
    return result;
}

void ntt(vector<long long>& coefficients, bool inverse) {
    int length = coefficients.size();
    for (int i = 1, j = 0; i < length; i++) {
        int bit = length >> 1;
        for (; j & bit; bit >>= 1) j ^= bit;
        j ^= bit;
        if (i < j) swap(coefficients[i], coefficients[j]);
    }
    for (int blockSize = 2; blockSize <= length; blockSize <<= 1) {
        long long rootOfUnity = inverse
            ? modpow(PRIMITIVE_ROOT, MOD - 1 - (MOD - 1) / blockSize, MOD)
            : modpow(PRIMITIVE_ROOT, (MOD - 1) / blockSize, MOD);
        for (int blockStart = 0; blockStart < length; blockStart += blockSize) {
            long long currentRoot = 1;
            for (int offset = 0; offset < blockSize / 2; offset++) {
                long long upper = coefficients[blockStart + offset];
                long long lower = coefficients[blockStart + offset + blockSize / 2] * currentRoot % MOD;
                coefficients[blockStart + offset] = (upper + lower) % MOD;
                coefficients[blockStart + offset + blockSize / 2] = (upper - lower + MOD) % MOD;
                currentRoot = currentRoot * rootOfUnity % MOD;
            }
        }
    }
    if (inverse) {
        long long inverseLength = modpow(length, MOD - 2, MOD);
        for (auto& value : coefficients) {
            value = value * inverseLength % MOD;
        }
    }
}

The inverse root computation deserves a closer look. The forward root is \(g^{(p-1)/\text{blockSize}}\). The inverse root is its modular inverse: \(g^{-(p-1)/\text{blockSize}} = g^{(p-1) - (p-1)/\text{blockSize}}\). That's what MOD - 1 - (MOD - 1) / blockSize computes.

Polynomial Multiplication with NTT

vector<long long> multiplyPolynomials(const vector<long long>& polyA, const vector<long long>& polyB) {
    vector<long long> transformA(polyA.begin(), polyA.end());
    vector<long long> transformB(polyB.begin(), polyB.end());
    int resultSize = polyA.size() + polyB.size() - 1;
    int paddedLength = 1;
    while (paddedLength < resultSize) paddedLength <<= 1;
    transformA.resize(paddedLength);
    transformB.resize(paddedLength);
    ntt(transformA, false);
    ntt(transformB, false);
    for (int i = 0; i < paddedLength; i++) {
        transformA[i] = transformA[i] * transformB[i] % MOD;
    }
    ntt(transformA, true);
    transformA.resize(resultSize);
    return transformA;
}

Identical pipeline to the FFT version: pad, forward transform, pointwise multiply, inverse transform. But every answer is exact modulo \(998244353\). No floating-point errors, no rounding.

Complexity: \(O(n \log n)\) time, \(O(n)\) space.

What If the Problem Uses a Different Modulus?

Many problems use \(10^9 + 7\) as the modulus. Since \(10^9 + 6 = 2 \times 500000003\), it only has a factor of \(2^1\), so \(n\) can only be \(2\). Useless for NTT.

Solution: Chinese Remainder Theorem (CRT).

Pick two or three NTT-friendly primes whose product exceeds the maximum possible coefficient value (before taking the final modulus). Perform NTT multiplication modulo each prime separately, then reconstruct the true coefficient using CRT, and finally reduce modulo the target.

For example, \(998244353 \times 985661441 \approx 9.8 \times 10^{17}\). If coefficients before modding can be at most \(\sim 10^{18}\), two primes suffice. For larger values, use three primes.

The CRT reconstruction for two primes \(p_1, p_2\):

Given \(c \equiv r_1 \pmod{p_1}\) and \(c \equiv r_2 \pmod{p_2}\), the unique \(c\) modulo \(p_1 p_2\) is:

\[c = r_1 + p_1 \cdot ((r_2 - r_1) \cdot p_1^{-1} \bmod p_2)\]

Then reduce \(c\) modulo the target prime \(q\).

This costs \(3\times\) the NTT work but gives exact results modulo any prime.

Subset Sum Convolution: A Preview

Standard polynomial multiplication handles "additive convolution" --- combine contributions that add up to \(k\). But some problems need OR convolution or subset sum convolution over bitmasks.

The connection to Ch1 L3 (if the course covers bitmask DP): given two functions \(f\) and \(g\) over subsets of \(\{0, 1, \ldots, m-1\}\), the subset sum convolution is:

\[(f *_{\text{subset}} g)(S) = \sum_{T \subseteq S} f(T) \cdot g(S \setminus T)\]

This isn't a standard polynomial multiplication --- you can't just NTT and pointwise multiply. The trick involves "ranking" by popcount: stratify \(f\) and \(g\) by the number of bits, perform additive convolution at each rank, then recombine. Each rank uses NTT.

The full technique is beyond this lesson's scope, but the takeaway is that NTT is the computational backbone. Once you have fast modular convolution, you can build subset sum convolution, AND/OR convolution, and other transforms on top of it.

NTT vs FFT: When to Use Which

Criterion	FFT (complex)	NTT (modular)
Precision	Floating-point, ~\(15\) digits	Exact modular arithmetic
Modular answers	Compute, round, then mod (risky)	Native --- all operations mod \(p\)
Speed	Faster per operation (hardware FP)	Slightly slower (modular multiply)
Max poly size	Limited by FP precision	Limited by \(p - 1\)'s power-of-\(2\) factor
Arbitrary modulus	Works directly (then round and mod)	Needs CRT for non-NTT primes

Rule of thumb: if the problem says "output modulo \(998244353\)," use NTT. If the problem says "output modulo \(10^9 + 7\)," use NTT with CRT. If the problem has no modulus and coefficients are small, FFT with double is fine.

Common Mistakes

Using the wrong primitive root. \(3\) is the primitive root for \(998244353\), not for every prime. For a different NTT prime, you need to find its primitive root. Using the wrong one means \(\omega^n \neq 1 \pmod{p}\) and the transform is nonsense.
Forgetting + MOD before taking mod of a subtraction. In the butterfly, upper - lower can be negative. Always write (upper - lower + MOD) % MOD. This is the single most common NTT bug.
Exceeding the maximum transform size. For \(p = 998244353\), the maximum power-of-\(2\) dividing \(p - 1\) is \(2^{23}\). If your padded polynomial length exceeds \(2^{23}\), the primitive root of the required order doesn't exist modulo this prime. You'd need a different prime or a different approach.
Not reducing input coefficients modulo \(p\) before NTT. If input coefficients can be negative or exceed \(p\), reduce them first. The NTT assumes all values are in \([0, p)\).

Quick Recap

NTT is FFT over a finite field \(\mathbb{Z}/p\mathbb{Z}\). Same algorithm, same \(O(n \log n)\) complexity, but exact modular arithmetic instead of floating point.
It requires a prime \(p\) where \(p - 1\) has a large power-of-\(2\) factor, and a primitive root \(g\) modulo \(p\).
\(998244353 = 119 \times 2^{23} + 1\) with primitive root \(3\) is the standard choice in competitive programming.
For problems with a non-NTT-friendly modulus (like \(10^9 + 7\)), use CRT: perform NTT modulo two or three NTT-friendly primes and reconstruct.
NTT is the computational backbone for modular convolution, subset sum convolution, and many advanced polynomial techniques.
Bitwise convolution (XOR, OR, AND) is handled by FWHT --- a different transform with a different butterfly. It gets its own lesson next (Ch11 L5).
FWHT extends the transform-multiply-invert paradigm to bitwise convolutions (XOR, OR, AND) in \(O(M \log M)\). Each operation has its own butterfly rule: XOR uses \(L' = L+R, R' = L-R\); OR uses \(R' = L+R\); AND uses \(L' = L+R\). The OR forward transform is equivalent to Sum Over Subsets DP.

Problems

Problem	Link	Difficulty	Focus
CSES -- Polynomial Multiplication	cses.fi/problemset/task/2111	Medium	NTT or FFT, straightforward application
CF 1251F -- Red-White Fence	codeforces.com/problemset/problem/1251/F	Hard	Polynomial multiplication with combinatorial setup, mod arithmetic
CF 632E -- Thief in a Shop	codeforces.com/problemset/problem/632/E	Hard	Polynomial exponentiation via NTT, knapsack as convolution

This concludes Chapter 11: Polynomials and Transforms. You now have the full pipeline --- encoding counting problems as polynomials (L1), understanding the DFT and why roots of unity make it invertible (L2), implementing FFT for \(O(n \log n)\) multiplication (L3), and NTT for exact modular convolution (L4).