Feature Flags and A/B Testing Infrastructure

TL;DR

A feature flag is a runtime boolean that controls which code path executes. Deploy code with the flag off, then enable it for 1% of users, then 10%, then 100%. If something breaks, disable the flag instantly -- no deploy needed. Sticky assignment uses hash(user_id + flag_name) % 100 to ensure the same user always sees the same variant. A/B testing extends this with statistical rigor: experiment framework, metrics collection, significance testing, and automated guardrails. LaunchDarkly, Statsig, and Unleash are the major platforms. The operational pattern -- deploy dark, enable gradually, kill instantly -- is how every serious engineering organization ships features.

The Problem

You are shipping a new recommendation algorithm. The old algorithm works. The new one is theoretically better but untested in production. Three options:

Option 1: Big-bang deploy. Merge to main, deploy to all users. If the new algorithm crashes or recommends garbage, all users are affected. Rollback requires a new deploy (5-30 minutes).

Option 2: Code branch. Maintain two separate codebases -- one with the old algorithm, one with the new. Deploy the new branch to a staging environment. Test. Then do a big-bang deploy to production. This is just Option 1 with a slower feedback loop.

Option 3: Feature flag. Deploy the new code behind a flag. Enable for 1% of users in production. Monitor metrics. If recommendations improve, ramp to 10%, 50%, 100%. If anything breaks, disable the flag in seconds. Both code paths exist in the same deploy. No separate branches. No risky big-bang.

Option 3 is the only one used by teams that ship fast and break nothing.

The Algorithm

Basic Feature Flag

def get_recommendations(user):
    if feature_flags.is_enabled("new_reco_algorithm", user):
        return new_recommendation_engine(user)
    else:
        return old_recommendation_engine(user)

The flag evaluation must be: - Fast: <1ms. Feature flags are checked millions of times per second. - Deterministic: Same user always gets the same result for the same flag. - Configurable: Change the flag without deploying code.

Flag Types

Type	What It Controls	Example
Kill switch	On/off for a feature	Disable payment processing
Percentage rollout	Enable for X% of users	Show new UI to 5% of users
User targeting	Enable for specific users/segments	Enable for beta testers only
A/B experiment	Assign users to variants with metrics	Compare conversion rate of two UIs

Sticky Assignment

The critical requirement: the same user must always see the same variant. If a user sees the new UI on one page load and the old UI on the next, the experience is confusing and the A/B test data is contaminated.

def is_enabled(flag_name, user_id, rollout_percentage):
    # Hash produces a deterministic number in [0, 99]
    bucket = hash(user_id + flag_name) % 100

    # Same user + same flag = same bucket = same result
    return bucket < rollout_percentage

Why flag_name is included in the hash: Without it, the same 5% of users would be in every experiment. By including flag_name, each experiment gets a different random 5%. This prevents correlation between experiments.

Worked Example: Gradual Rollout

Flag: "new_checkout_flow"
Rollout: 0% → 5% → 25% → 50% → 100%

Day 1: rollout = 0%
  All users: hash(user_id + "new_checkout_flow") % 100 >= 0. Old flow.

Day 2: rollout = 5%
  Users with bucket < 5: new flow (e.g., user "alice" → bucket 3 → new flow)
  Users with bucket >= 5: old flow (e.g., user "bob" → bucket 47 → old flow)

Day 3: rollout = 25%
  User "alice" still bucket 3. Still in new flow. Consistent.
  Users with bucket < 25: new flow. Includes all users from 5% group.

Day 5: rollout = 50%. Monitor conversion rate.
Day 7: rollout = 100%. Old code can be removed.

Key property: Users who were in the 5% group are still in the 25% group and the 50% group. The rollout only adds users, never removes them. This is achieved because expanding the rollout percentage from 5% to 25% includes all buckets [0, 5) (already in) plus [5, 25) (newly added).

Kill Switch Pattern

def process_payment(order):
    if feature_flags.is_enabled("payment_processing"):
        return payment_gateway.charge(order)
    else:
        return {"error": "Payment processing is temporarily unavailable"}

When the payment gateway has an incident: 1. On-call engineer opens feature flag dashboard. 2. Disables "payment_processing" flag. 3. All payment attempts immediately return a friendly error. 4. No deploy needed. Takes effect in seconds.

This is faster and safer than deploying a code change. The flag state is fetched from the flag service, not embedded in the binary.

A/B Testing Infrastructure

Experiment Framework

A/B testing extends feature flags with statistical rigor:

Experiment: "New checkout button color"
  Variant A (control): Blue button (50% of users)
  Variant B (treatment): Green button (50% of users)
  Primary metric: Conversion rate (purchases / checkout page views)
  Guardrail metrics: Error rate, page load time, revenue per user
  Duration: 2 weeks
  Required sample size: 10,000 per variant (calculated from desired statistical power)

Assignment

def get_variant(experiment_name, user_id):
    bucket = hash(user_id + experiment_name) % 100

    if bucket < 50:
        return "control"    # Variant A
    else:
        return "treatment"  # Variant B

Metrics Collection

For each user-experiment-variant assignment:

Event log:
  {user_id: "u123", experiment: "checkout_color", variant: "B",
   event: "page_view", timestamp: "2024-01-15T10:00:00Z"}
  {user_id: "u123", experiment: "checkout_color", variant: "B",
   event: "purchase", timestamp: "2024-01-15T10:02:30Z"}

Aggregate:

Variant A: 10,000 views, 340 purchases → 3.40% conversion
Variant B: 10,000 views, 380 purchases → 3.80% conversion

Difference: +0.40 percentage points
Is this statistically significant? (not just random noise?)

Statistical Significance

Use a two-proportion z-test (or chi-squared test):

H0: conversion rate A = conversion rate B (no difference)
H1: conversion rate A ≠ conversion rate B

p_A = 340/10000 = 0.034
p_B = 380/10000 = 0.038
p_pooled = 720/20000 = 0.036

SE = sqrt(p_pooled * (1 - p_pooled) * (1/n_A + 1/n_B))
   = sqrt(0.036 * 0.964 * (1/10000 + 1/10000))
   = 0.00263

z = (p_B - p_A) / SE = (0.038 - 0.034) / 0.00263 = 1.52

p-value = 0.128 (two-tailed)

At significance level 0.05: p-value > 0.05. NOT significant.
We cannot conclude the green button is better. Need more data or larger effect.

Sample size planning: Before running an experiment, calculate the required sample size based on: - Minimum detectable effect (e.g., 0.5 percentage points improvement) - Baseline conversion rate (e.g., 3.4%) - Desired power (typically 80%) - Significance level (typically 5%)

This determines how long the experiment must run. Running an experiment for too short a time leads to underpowered results. Running for too long wastes time.

Guardrails

Automated guardrails prevent experiments from causing harm:

Guardrail checks (run every hour):
  1. Error rate for variant B > 2x control → auto-disable experiment
  2. P99 latency for variant B > 500ms → alert on-call
  3. Revenue per user for variant B < 90% of control → auto-disable
  4. Crash rate for variant B > 1% → auto-disable

Guardrails use one-sided tests with lower significance thresholds (e.g., p < 0.01 for error rate increase). The goal is to catch regressions fast, even at the cost of occasional false alarms.

Architecture

Flag Evaluation Architecture

┌─────────────┐         ┌──────────────┐
│ Flag Service │ ◄────── │  Admin UI     │ (engineer updates flag)
│ (source of   │         └──────────────┘
│  truth)      │
└──────┬──────┘
       │ periodic sync (every 30s)
       │ or streaming (SSE / WebSocket)
       │
┌──────▼──────┐
│ Local SDK    │  (in each application instance)
│ Cache        │  - stores all flag definitions
│              │  - evaluates flags locally (no network call)
└─────────────┘

Critical design decision: Flag evaluation happens locally from a cached copy of flag definitions. The application does not make a network call for every flag check. Instead, it periodically synchronizes the flag definitions from the flag service. This keeps evaluation fast (<1ms) and available even if the flag service is temporarily down.

Storage

Flag definition:
{
  "name": "new_checkout_flow",
  "type": "percentage_rollout",
  "rollout_percentage": 25,
  "targeting_rules": [
    {"attribute": "country", "operator": "in", "values": ["US", "CA"]},
    {"attribute": "plan", "operator": "eq", "value": "enterprise"}
  ],
  "kill_switch": false,
  "created_by": "eng@example.com",
  "created_at": "2024-01-10T10:00:00Z"
}

Targeting rules are evaluated in order. If a rule matches, its outcome (on/off/percentage) is used. Otherwise, the default rollout percentage applies.

Proof/Correctness Intuition

Why Hash-Based Assignment Is Sticky

hash(user_id + flag_name) is a deterministic function. The same inputs always produce the same output. As long as the user's ID and the flag name do not change, the bucket assignment is the same on every evaluation, on every server, without any shared state.

The hash function should be fast and have uniform distribution. MurmurHash3 or FNV-1a are common choices. Cryptographic hashes (SHA-256) are unnecessary and too slow.

Why Including Flag Name Prevents Correlation

Without flag_name in the hash:

hash("user_123") % 100 = 7

This user is in the bottom 10% for EVERY experiment.
If 10% rollouts A, B, and C all use hash(user_id),
user_123 is in all three. Experiments are correlated.

With flag_name:

hash("user_123" + "experiment_A") % 100 = 7    → in A
hash("user_123" + "experiment_B") % 100 = 63   → NOT in B
hash("user_123" + "experiment_C") % 100 = 91   → NOT in C

Each experiment gets an independent random sample of users.

Real-World Usage

Platform	Key Feature	Pricing Model
LaunchDarkly	Enterprise-grade, real-time streaming, rich targeting	Per-seat + per-MAU
Statsig	Built-in experimentation + statistical analysis	Free tier, per-event
Unleash	Open source, self-hosted option	Free (OSS) / paid cloud
Split.io	Feature flags + experimentation platform	Per-seat
Optimizely	A/B testing focused, visual editor	Per-impression
Flagsmith	Open source, remote config + flags	Free (OSS) / paid cloud
GrowthBook	Open source, Bayesian statistics	Free (OSS) / paid cloud

Build vs buy: For a small team, a feature flag system is 200 lines of code (flag definitions in a config file, hash-based assignment, no UI). For a large organization with hundreds of experiments, a platform like LaunchDarkly provides targeting rules, audit logs, scheduled rollouts, and integration with metrics pipelines. The build-vs-buy breakpoint is around 10-20 concurrent experiments.

Industry Practice

Netflix, Google, Facebook, LinkedIn, Uber, and Airbnb all run thousands of concurrent experiments. Facebook has reported running 10,000+ experiments simultaneously. The feature flag / experimentation platform is considered core infrastructure at these companies.

The standard deployment pattern:

1. Developer writes code with feature flag.
2. Code is merged and deployed with flag OFF (dark launch).
3. Enable flag for internal employees (dogfooding).
4. Enable for 1% of production users. Monitor for 24 hours.
5. If metrics are positive: ramp to 10%, 50%, 100%.
6. If metrics are negative or neutral: disable flag. Iterate.
7. After 100% rollout is stable for 2 weeks: remove flag and old code path.

Step 7 is critical and often neglected. Stale feature flags accumulate as technical debt. The flag service should track flag age and alert when flags are older than 30 days without being fully rolled out or removed.

Interview Application

When to mention feature flags:

"How would you safely deploy a new feature?" -- Feature flag. Deploy dark, enable gradually, monitor, rollback instantly.
"How would you do A/B testing?" -- Feature flag with experiment framework. Sticky assignment, metrics collection, significance testing.
"How would you implement a kill switch?" -- Feature flag that disables a feature instantly without deploy.
"How do you roll back a bad deployment?" -- Feature flag is faster than rollback. Toggle the flag.

What interviewers want to hear:

You understand the deploy-dark-then-enable pattern.
You know sticky assignment via hashing (same user, same experience).
You understand why experiments need statistical significance testing.
You know the operational pattern: 1% → 10% → 50% → 100%.
You can discuss guardrails (auto-disable if error rate spikes).

Trade-offs

Ab Test Lifecycle

Advantage	Disadvantage
Instant enable/disable (no deploy)	Code complexity (branching logic)
Gradual rollout (reduce risk)	Stale flags become tech debt
A/B testing enables data-driven decisions	Testing combinatorics (flag interactions)
Kill switch for emergencies	Flag evaluation overhead (minimal but nonzero)
Per-user targeting	Debugging is harder (which flags are on?)
Decouples deploy from release	Configuration drift between environments

The Testing Problem

With 10 feature flags, there are 2^10 = 1024 possible combinations. You cannot test all of them. In practice:

Each flag should be independently testable (no hidden dependencies between flags).
Integration tests should run with all flags ON and all flags OFF.
The experimentation platform should prevent conflicting experiments (e.g., two experiments both changing the checkout page).

Flag Lifecycle Management

States: Draft → Active → Rolled Out (100%) → Archived (code removed)

Draft:     Flag defined but not enabled for any users.
Active:    Flag is being rolled out (1-99% or targeted).
Rolled Out: Flag is at 100%. Code path is proven.
Archived:  Old code path has been removed. Flag definition kept for audit.

Set SLAs: a flag should not be in "Active" state for more than 30 days. A flag at 100% should have its old code path removed within 2 weeks. Violating these SLAs indicates stale flags that should be cleaned up.

Common Mistakes

Feature flags are just if-else statements

At the simplest level, yes. But a production feature flag system requires: persistent flag storage, admin UI, targeting rules, audit logging, SDK with local caching, gradual rollout support, and integration with metrics. The if-else statement is 1% of the system.

Leave feature flags in the code forever

Stale flags are technical debt. They make the codebase harder to understand, increase testing surface area, and create confusion about which code paths are actually active. Remove flags within 2 weeks of full rollout.

A/B test results are reliable after 1 day

Statistical significance requires sufficient sample size. Running an experiment for 1 day when you need 2 weeks of data leads to false conclusions. Additionally, behavior varies by day-of-week (weekday vs weekend traffic), so experiments should run for at least one full week.

Just compare averages to decide which variant wins

Averages can be misleading. If variant B has a higher average conversion rate but the difference is not statistically significant, it could be random noise. Always compute a p-value or confidence interval. Bayesian methods (used by GrowthBook) provide a probability that one variant is better, which is often more intuitive.

Feature flags and A/B tests are the same thing

Feature flags are an operational tool (deploy safely). A/B tests are a measurement tool (decide which variant is better). Feature flags can be used WITHOUT measurement (kill switches, gradual rollouts by gut feel). A/B tests REQUIRE measurement (metrics, significance testing). The infrastructure overlaps (both use sticky assignment), but the purposes differ.

Percentage rollout is random each time

If the percentage rollout were random, a user would see the new feature 5% of the time and the old feature 95% of the time, randomly switching between them. This would be a terrible experience and invalidate A/B test data. Sticky assignment via hashing ensures the same user always sees the same variant.