Probability Basics for CP

Pillar: Set — "A probability distribution is a SET of outcomes, each with a weight. DP on probability = DP on distributions."

The Setup

You roll two 6-sided dice. What's the probability the sum is exactly \(7\)?

You could list all \(36\) pairs and count the ones that sum to \(7\): \((1,6), (2,5), (3,4), (4,3), (5,2), (6,1)\). That's \(6\) out of \(36\), so \(1/6\). Easy for two dice. But what if you roll \(100\) dice and need the probability the sum falls between \(300\) and \(400\)? Brute-force enumeration is hopeless. You need DP.

The trick: treat the distribution of possible sums as a set — specifically, a mapping from each possible sum value to its probability. After rolling \(i\) dice, you have a distribution \(dp[i]\). Rolling one more die transforms \(dp[i]\) into \(dp[i+1]\) by spreading each entry across the face values. This is DP on distributions, and it's the backbone of probability in competitive programming.

Sample Space and Events

The sample space \(\Omega\) is the set of all possible outcomes of an experiment. For one 6-sided die, \(\Omega = \{1, 2, 3, 4, 5, 6\}\). For two dice, \(\Omega = \{(a, b) : a, b \in \{1, \ldots, 6\}\}\), which has \(|\Omega| = 36\) elements.

An event is a subset of \(\Omega\). "The sum is \(7\)" is the event \(\{(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)\}\).

For a uniform distribution (all outcomes equally likely):

\[P(\text{event}) = \frac{|\text{favorable outcomes}|}{|\Omega|}\]

This is the counting formula you already know from combinatorics. Probability is just counting with a denominator.

Conditional Probability

Sometimes you learn partial information. You rolled two dice and someone tells you the first die is a \(3\). What's the probability the sum is \(7\)?

The answer isn't \(6/36\) anymore. You've restricted the sample space to outcomes where the first die is \(3\) — that's \(\{(3,1), (3,2), (3,3), (3,4), (3,5), (3,6)\}\), six outcomes. Among those, only \((3,4)\) gives sum \(7\). So the probability is \(1/6\).

Conditional probability formalizes this. For events \(A\) and \(B\) with \(P(B) > 0\):

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}\]

In the example: \(A\) = "sum is \(7\)", \(B\) = "first die is \(3\)". \(P(A \cap B) = P(\{(3,4)\}) = 1/36\), \(P(B) = 6/36 = 1/6\). So \(P(A \mid B) = (1/36)/(1/6) = 1/6\).

Think of conditioning as shrinking the sample space to only the outcomes in \(B\), then re-measuring \(A\) within that smaller space.

The Chain Rule: Chaining Conditional Probabilities

What if you have a sequence of events and each one depends on the previous ones? The chain rule (also called the multiplication rule) lets you compute the joint probability by chaining conditionals:

\[P(A \cap B) = P(A) \cdot P(B \mid A)\]

For three events:

\[P(A \cap B \cap C) = P(A) \cdot P(B \mid A) \cdot P(C \mid A \cap B)\]

In general, for \(n\) events:

\[P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1})\]

Example: Drawing without replacement. A bag has 5 red and 3 blue balls. You draw two balls without replacement. What's the probability both are red?

\[P(\text{both red}) = P(R_1) \cdot P(R_2 \mid R_1) = \frac{5}{8} \cdot \frac{4}{7} = \frac{20}{56} = \frac{5}{14}\]

The second draw depends on what happened in the first — the sample space shrinks. The chain rule handles this by making each term conditional on everything before it.

When does the chain rule simplify?

When events are independent, each conditional collapses: \(P(B \mid A) = P(B)\). The chain rule becomes \(P(A \cap B) = P(A) \cdot P(B)\) — the multiplication rule for independent events. Independence is the special case where conditioning doesn't change anything. In CP, coin flips and dice rolls are independent (chain rule simplifies), but drawing without replacement or random walks on graphs are dependent (you need the full chain).

In CP, the chain rule appears whenever you're computing the probability of a path through a decision tree — each edge is a conditional probability, and the path probability is the product of all edges.

Bayes' Theorem: Reversing the Condition

The conditional probability formula gives you \(P(A \mid B)\). But what if you know \(P(B \mid A)\) and want \(P(A \mid B)\)? Bayes' theorem flips the direction:

\[P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}\]

Where does this come from?

Start with two ways to write \(P(A \cap B)\):

\[P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)\]

Set them equal and solve for \(P(A \mid B)\). That's the entire derivation — one line.

Example: Medical test. A disease affects 1% of the population. A test detects the disease 99% of the time (true positive), but has a 5% false positive rate. You test positive. What's the probability you actually have the disease?

Let \(D\) = has disease, \(+\) = tests positive.

\(P(D) = 0.01\), \(P(\neg D) = 0.99\)
\(P(+ \mid D) = 0.99\), \(P(+ \mid \neg D) = 0.05\)

First compute \(P(+)\) using the law of total probability:

\[P(+) = P(+ \mid D) \cdot P(D) + P(+ \mid \neg D) \cdot P(\neg D) = 0.99 \times 0.01 + 0.05 \times 0.99 = 0.0594\]

Then Bayes:

\[P(D \mid +) = \frac{0.99 \times 0.01}{0.0594} \approx 0.167\]

Only about 17% — even with a positive test, you probably don't have the disease. The base rate (1%) is so low that the false positives overwhelm the true positives.

Does Bayes' theorem show up in competitive programming?

Directly, rarely — most CP probability problems use linearity of expectation or DP instead. But the underlying pattern appears whenever you need to "update beliefs given evidence": hidden Markov models in string problems, posterior probabilities in randomized algorithms, and the "law of total probability" decomposition (\(P(B) = \sum_i P(B \mid A_i) P(A_i)\)) shows up constantly when computing expected values by conditioning on cases. The law of total probability is often more useful than Bayes itself — it lets you split a hard probability into easy conditional pieces.

Independence

Two events \(A\) and \(B\) are independent if knowing one gives you no information about the other. Formally:

\[P(A \cap B) = P(A) \cdot P(B)\]

Equivalently, \(P(A \mid B) = P(A)\) — conditioning on \(B\) doesn't change the probability of \(A\).

Dice rolls are independent: the outcome of the first die doesn't affect the second. So \(P(\text{first die} = 3 \text{ and second die} = 4) = (1/6)(1/6) = 1/36\).

Independence matters in CP because it tells you when you can multiply probabilities. If you're drawing cards without replacement, successive draws are NOT independent — you need conditional probability. If you're flipping coins, each flip IS independent — you can multiply directly.

DP on Probability Distributions

Here's where the Set pillar comes alive. After rolling \(i\) dice, the distribution of possible sums is a set of (sum, probability) pairs. Represent it as an array: \(dp[i][s]\) = probability that the sum equals \(s\) after \(i\) dice.

Base case. Before rolling any dice, the sum is \(0\) with certainty: \(dp[0][0] = 1\).

Transition. When you roll die \(i\), each face value \(f \in \{1, \ldots, F\}\) appears with probability \(1/F\) (for an \(F\)-sided die). Each existing sum \(s\) from \(dp[i-1]\) spawns \(F\) new sums:

\[dp[i][s + f] \mathrel{+}= dp[i-1][s] \cdot \frac{1}{F}\]

This is a convolution: you're combining the distribution \(dp[i-1]\) with the uniform distribution over \(\{1, \ldots, F\}\). Each die roll adds another layer to the convolution, spreading and flattening the distribution.

Trace Table: 2 Dice, 6 Faces

After die 1: Each face appears with probability \(1/6\).

Sum \(s\)	\(dp[1][s]\)
\(1\)	\(1/6\)
\(2\)	\(1/6\)
\(3\)	\(1/6\)
\(4\)	\(1/6\)
\(5\)	\(1/6\)
\(6\)	\(1/6\)

After die 2: For each entry in \(dp[1]\), spread it across \(6\) faces.

Take \(dp[1][1] = 1/6\). Adding faces \(1\) through \(6\) gives sums \(2\) through \(7\), each receiving \(1/6 \times 1/6 = 1/36\).

Take \(dp[1][2] = 1/6\). Adding faces \(1\) through \(6\) gives sums \(3\) through \(8\), each receiving \(1/36\).

Repeat for all entries in \(dp[1]\) and sum the contributions:

Sum \(s\)	\(dp[2][s]\)	As fraction
\(2\)	\(1/36\)	\(0.0278\)
\(3\)	\(2/36\)	\(0.0556\)
\(4\)	\(3/36\)	\(0.0833\)
\(5\)	\(4/36\)	\(0.1111\)
\(6\)	\(5/36\)	\(0.1389\)
\(7\)	\(6/36\)	\(0.1667\)
\(8\)	\(5/36\)	\(0.1389\)
\(9\)	\(4/36\)	\(0.1111\)
\(10\)	\(3/36\)	\(0.0833\)
\(11\)	\(2/36\)	\(0.0556\)
\(12\)	\(1/36\)	\(0.0278\)

The distribution is triangular — peaked at \(7\), symmetric around the midpoint. Sum of all probabilities: \(36/36 = 1\). Check passed.

The probability of sum in \([6, 8]\) is \(5/36 + 6/36 + 5/36 = 16/36 \approx 0.4444\).

Representing Distributions in Code

The distribution after \(i\) dice is an array indexed by sum value. If each die has \(F\) faces and you roll \(n\) dice, the possible sums range from \(n\) (all ones) to \(nF\) (all max). The array has \(n(F - 1) + 1\) entries.

This representation works when the state space is small enough to enumerate. For \(100\) six-sided dice, the sum ranges from \(100\) to \(600\) — only \(501\) values. That's tiny.

When the state space is too large to enumerate (e.g., the sum could be up to \(10^{18}\)), you can't store the full distribution. In that case, switch to tracking just the expected value — a single number that summarizes the distribution. That's what Ch7 L2 covers with linearity of expectation.

Modular Probability

In competitive programming, problems often ask for the probability \(p/q\) as \(p \cdot q^{-1} \bmod (10^9 + 7)\), where \(q^{-1}\) is the modular inverse.

The idea: instead of tracking floating-point probabilities, count the number of favorable outcomes (an integer) and divide by the total outcomes at the end — all in modular arithmetic.

For the dice problem, let \(\text{ways}[i][s]\) = number of ways to get sum \(s\) after \(i\) dice (integers, not probabilities). The total number of outcomes after \(n\) dice is \(F^n\). The probability of sum in \([a, b]\) is:

\[\frac{\sum_{s=a}^{b} \text{ways}[n][s]}{F^n} \bmod (10^9 + 7)\]

Compute the numerator modulo \(10^9 + 7\). Compute \(F^n \bmod (10^9 + 7)\) using binary exponentiation. Then multiply the numerator by the modular inverse of the denominator, using modpow from Ch4 L3.

This avoids all floating-point issues. No precision loss, no rounding errors. The DP transition becomes:

\[\text{ways}[i][s + f] \mathrel{+}= \text{ways}[i-1][s] \pmod{10^9 + 7}\]

Same structure as before, just with integers and a modulus instead of doubles.

The Code

Floating-point version for problems that ask for a decimal answer (like CSES Dice Probability):

#include <bits/stdc++.h>
using namespace std;

int main() {
    int numDice, numFaces;
    cin >> numDice >> numFaces;
    int targetLow, targetHigh;
    cin >> targetLow >> targetHigh;

    int maxSum = numDice * numFaces;
    vector<vector<double>> probability(numDice + 1, vector<double>(maxSum + 1, 0.0));
    probability[0][0] = 1.0;

    double perFaceProbability = 1.0 / numFaces;

    for (int die = 1; die <= numDice; die++) {
        for (int prevSum = 0; prevSum <= (die - 1) * numFaces; prevSum++) {
            if (probability[die - 1][prevSum] == 0.0) continue;
            for (int face = 1; face <= numFaces; face++) {
                probability[die][prevSum + face] += probability[die - 1][prevSum] * perFaceProbability;
            }
        }
    }

    double totalProbability = 0.0;
    for (int sumValue = targetLow; sumValue <= targetHigh; sumValue++) {
        totalProbability += probability[numDice][sumValue];
    }
    cout << fixed << setprecision(6) << totalProbability << "\n";
    return 0;
}

Modular version for problems that ask for \(p \cdot q^{-1} \bmod 10^9 + 7\):

#include <bits/stdc++.h>
using namespace std;

const long long MOD = 1e9 + 7;

long long modpow(long long base, long long exponent, long long modulus) {
    long long result = 1;
    base %= modulus;
    while (exponent > 0) {
        if (exponent & 1) result = result * base % modulus;
        base = base * base % modulus;
        exponent >>= 1;
    }
    return result;
}

long long modinv(long long value, long long modulus) {
    return modpow(value, modulus - 2, modulus);
}

int main() {
    int numDice, numFaces;
    cin >> numDice >> numFaces;
    int targetLow, targetHigh;
    cin >> targetLow >> targetHigh;

    int maxSum = numDice * numFaces;
    vector<vector<long long>> ways(numDice + 1, vector<long long>(maxSum + 1, 0));
    ways[0][0] = 1;

    for (int die = 1; die <= numDice; die++) {
        for (int prevSum = 0; prevSum <= (die - 1) * numFaces; prevSum++) {
            if (ways[die - 1][prevSum] == 0) continue;
            for (int face = 1; face <= numFaces; face++) {
                ways[die][prevSum + face] = (ways[die][prevSum + face] + ways[die - 1][prevSum]) % MOD;
            }
        }
    }

    long long favorableWays = 0;
    for (int sumValue = targetLow; sumValue <= targetHigh; sumValue++) {
        favorableWays = (favorableWays + ways[numDice][sumValue]) % MOD;
    }

    long long totalOutcomes = modpow(numFaces, numDice, MOD);
    long long answer = favorableWays % MOD * modinv(totalOutcomes, MOD) % MOD;
    cout << answer << "\n";
    return 0;
}

Complexity

\(O(n \times n F \times F)\) = \(O(n^2 F^2)\) time, \(O(n F)\) space (can be optimized to two rows).

Common Mistakes

Using floating-point when the problem asks for modular output. If the problem says "output \(p \cdot q^{-1} \bmod 10^9 + 7\)," you need the integer DP counting ways, not the double DP computing probabilities. Mixing these up gives wrong answers.
Off-by-one on the sum range. The minimum sum after \(n\) dice is \(n\) (not \(0\)), and the maximum is \(nF\). If you initialize your DP bounds wrong or iterate outside \([n, nF]\) when summing the answer, you'll read uninitialized values or miss valid sums.
Forgetting the base case. The DP starts at \(dp[0][0] = 1\) (or \(\text{ways}[0][0] = 1\)). If you start at \(dp[1][\cdot]\) and try to build from there, you'll miss the proper initialization and the entire distribution will be zero.

Quick Recap

The sample space \(\Omega\) is the set of all outcomes. An event is a subset. For uniform distributions, \(P(\text{event}) = |\text{event}|/|\Omega|\).
Conditional probability shrinks the sample space: \(P(A \mid B) = P(A \cap B) / P(B)\). Independence means \(P(A \cap B) = P(A) \cdot P(B)\).
DP on probability distributions: the state is a distribution (array of probabilities indexed by outcome), and each step convolves it with a transition distribution.
For modular probability, count integer ways instead of tracking doubles, then divide by the total outcomes using modular inverse at the end.
When the state space is too large for a full distribution, switch to expected value (Ch7 L2).

Problems

Problem	Link	Difficulty	Focus
CSES — Dice Probability	cses.fi/problemset/task/1725	Medium	Direct DP on probability distributions; floating-point output
LC 1227 — Airplane Seat Assignment	leetcode.com/problems/airplane-seat-assignment-probability	Medium	Probability recurrence; recognizing the pattern collapses to \(1/2\) for \(n \ge 2\)