Skip to content

Indexing — B-tree, GIN, GiST, and When to Use Each

TL;DR

PostgreSQL has six index types and most engineers only know one. Picking the wrong index is worse than having no index -- it burns disk space, slows down writes, and gives you a false sense of security when the planner ignores it entirely.


Why Indexes Matter in System Design

Explain Scans

In interviews, you'll sketch a schema and say "we'll index this column." The interviewer asks: "What kind of index?" If your only answer is "a normal one," you've lost points.

PostgreSQL's index diversity is one of its superpowers over MySQL. You can build a full-text search engine, a geospatial query system, and a time-series analytics pipeline -- all inside a single database. Knowing which index to reach for is a design skill, not just a DBA skill.


B-tree: The Default Workhorse

B-tree indexes handle equality (=) and range queries (<, >, BETWEEN, ORDER BY). They cover 90% of real-world use cases. When you write CREATE INDEX without specifying a type, you get a B-tree.

How It Works

A B-tree is a balanced tree where each node contains sorted keys and pointers. Leaf nodes form a linked list for efficient range scans. Lookup is O(log N). For a table with 100 million rows and a typical fanout of ~500 keys per page, that's about 3-4 page reads to find any row (log500(10^8) ≈ 3).

-- These all use B-tree efficiently
CREATE INDEX idx_users_email ON users (email);

SELECT * FROM users WHERE email = 'alice@example.com';     -- equality
SELECT * FROM users WHERE email > 'a' AND email < 'b';     -- range
SELECT * FROM users ORDER BY email LIMIT 10;                -- sorted access
SELECT * FROM users WHERE email IS NULL;                    -- NULL lookup

What B-tree Cannot Do

-- B-tree is useless for these
SELECT * FROM users WHERE email LIKE '%@gmail.com';  -- leading wildcard
SELECT * FROM articles WHERE tags @> ARRAY['postgres'];  -- array containment
SELECT * FROM products WHERE description @@ 'coffee & organic';  -- full-text
SELECT * FROM locations WHERE point <-> '(40.7,-74.0)' < 0.1;  -- spatial

If your query pattern doesn't start from the left side of the index, B-tree can't help. This trips up more candidates than any other indexing concept.


GIN: The Inverted Index Inside PostgreSQL

GIN stands for Generalized Inverted Index. It maps each element (word, array value, JSON key) to the list of rows that contain it. This is the same structure Elasticsearch uses internally.

-- Add a tsvector column and GIN index
ALTER TABLE articles ADD COLUMN search_vector tsvector;
UPDATE articles SET search_vector = to_tsvector('english', title || ' ' || body);
CREATE INDEX idx_articles_search ON articles USING gin(search_vector);

-- Now query it
SELECT title FROM articles
WHERE search_vector @@ to_tsquery('english', 'postgres & replication');

For applications under 10 million documents, this replaces Elasticsearch entirely. Notion used PostgreSQL full-text search for years before their scale demanded a dedicated search service. Fewer moving parts, fewer failure modes.

Spicy opinion: If you're adding Elasticsearch to your architecture for fewer than 5 million documents, you're creating an operational burden for zero gain. PostgreSQL GIN does the job with one less service to monitor, back up, and debug at 3 AM.

JSONB Containment

-- Index JSONB documents for containment queries
CREATE INDEX idx_events_data ON events USING gin(data);

-- Finds rows where data contains {"type": "click", "page": "/home"}
SELECT * FROM events WHERE data @> '{"type": "click", "page": "/home"}';

Array Membership

-- Index array columns
CREATE INDEX idx_products_tags ON products USING gin(tags);

-- Find products tagged with 'organic'
SELECT * FROM products WHERE tags @> ARRAY['organic'];

-- Find products tagged with ANY of these
SELECT * FROM products WHERE tags && ARRAY['organic', 'vegan'];

GIN Trade-offs

GIN indexes are slow to build and update because inserting one row might mean updating dozens of index entries (one per word/element). PostgreSQL mitigates this with a "fastupdate" pending list, but writes are still heavier than B-tree.

Aspect B-tree GIN
Build time Fast 3-10x slower
Write overhead Low (one entry per row) High (one entry per element)
Read performance O(log N) O(1) per element lookup
Good for Scalar equality/range Multi-valued data (arrays, text, JSONB)

GiST: The Spatial and Range Index

GiST (Generalized Search Tree) handles data types where "equals" and "less than" aren't enough. Think: "Which restaurants are within 500 meters?" or "Which bookings overlap this time range?"

PostGIS Geospatial Queries

-- Index geographic points
CREATE INDEX idx_restaurants_location ON restaurants USING gist(location);

-- Find restaurants within 500m of Times Square
SELECT name, ST_Distance(location, ST_MakePoint(-73.985, 40.758))
FROM restaurants
WHERE ST_DWithin(location, ST_MakePoint(-73.985, 40.758), 500)
ORDER BY location <-> ST_MakePoint(-73.985, 40.758);

Uber, Lyft, and DoorDash all use PostGIS for driver/rider matching. The <-> operator gives you nearest-neighbor search using the GiST index -- no need for a separate geospatial database.

Range Type Queries

-- Booking system: find overlapping reservations
CREATE TABLE bookings (
    room_id int,
    during tstzrange  -- timestamp range
);
CREATE INDEX idx_bookings_during ON bookings USING gist(during);

-- Find all bookings that overlap with a proposed time
SELECT * FROM bookings
WHERE room_id = 101
AND during && tstzrange('2025-03-15 14:00', '2025-03-15 16:00');

Exclusion Constraints

GiST enables one of PostgreSQL's most underused features: exclusion constraints. They prevent overlapping data at the database level.

-- Guarantee no double-bookings. The database enforces it.
ALTER TABLE bookings ADD CONSTRAINT no_overlap
    EXCLUDE USING gist (room_id WITH =, during WITH &&);

-- This INSERT will fail if it overlaps an existing booking
INSERT INTO bookings (room_id, during)
VALUES (101, tstzrange('2025-03-15 15:00', '2025-03-15 17:00'));
-- ERROR: conflicting key value violates exclusion constraint

No application code needed. No race conditions. The constraint uses the GiST index to check efficiently.


BRIN: Tiny Index for Huge Tables

BRIN (Block Range Index) stores summary statistics (min/max) for each block of pages. It's tiny -- often 1000x smaller than a B-tree on the same column. The catch: it only works when data is physically ordered on disk.

When BRIN Shines

-- Log table: rows are inserted chronologically, so created_at
-- is naturally correlated with physical disk order
CREATE TABLE logs (
    id bigserial,
    created_at timestamptz DEFAULT now(),
    level text,
    message text
);

-- BRIN index: ~100KB for a table where B-tree would be ~2GB
CREATE INDEX idx_logs_created ON logs USING brin(created_at);

-- Query scans only the blocks that might contain matching rows
SELECT * FROM logs
WHERE created_at BETWEEN '2025-03-01' AND '2025-03-02';

When BRIN Fails

If data is inserted randomly (not ordered by the indexed column), BRIN's block summaries become useless. Every block's min/max range overlaps with every query range, so PostgreSQL scans everything.

-- BAD: user_id has no correlation with insertion order
CREATE INDEX idx_logs_user ON logs USING brin(user_id);  -- useless
Aspect B-tree BRIN
Index size (100M rows) ~2 GB ~1 MB
Works with random data Yes No -- needs physical ordering
Lookup precision Exact row Block range (may scan extra rows)
Maintenance cost Medium Very low
Best for OLTP lookups Time-series, append-only logs

Partial Indexes: Index Only What You Query

Why index 100 million rows when you only ever query the 50,000 active ones?

-- Full index: covers all 100M orders, most of which are 'completed'
CREATE INDEX idx_orders_status ON orders (status);  -- huge, mostly wasted

-- Partial index: covers only the ~50K pending/processing orders
CREATE INDEX idx_orders_active ON orders (status)
WHERE status IN ('pending', 'processing');  -- tiny, exactly what you need

Partial indexes are smaller, faster to scan, and cheaper to maintain. They're one of the first optimizations you should reach for when you notice an index covers mostly irrelevant rows.

System design pattern: In a marketplace system, most orders are completed. Active orders (pending, in-transit) are a tiny fraction but get queried constantly. A partial index on WHERE status != 'completed' cuts index size by 95%.


Composite Indexes: Column Order Is Everything

A composite index on (a, b, c) is actually three indexes in one:

  • Efficient for WHERE a = ?
  • Efficient for WHERE a = ? AND b = ?
  • Efficient for WHERE a = ? AND b = ? AND c = ?
  • NOT efficient for WHERE b = ? alone (skips the leading column)
  • NOT efficient for WHERE c = ? alone

Think of it like a phone book sorted by (last_name, first_name). You can look up "Smith" or "Smith, John" but not "John" across all last names.

-- Design the index to match your WHERE clause, left to right
CREATE INDEX idx_orders_user_status ON orders (user_id, status, created_at);

-- Uses the index (full match)
SELECT * FROM orders WHERE user_id = 42 AND status = 'pending'
ORDER BY created_at DESC;

-- Uses the index (prefix match)
SELECT * FROM orders WHERE user_id = 42;

-- DOES NOT use the index (skips leading column)
SELECT * FROM orders WHERE status = 'pending';

Rule of thumb: Put high-selectivity columns first (the column that filters the most rows). Then add columns in the order they appear in your WHERE and ORDER BY clauses.


Covering Indexes: Skip the Table Lookup

When PostgreSQL finds a row using an index, it normally has to visit the heap (table) to fetch the remaining columns. A covering index stores extra columns inside the index using INCLUDE, making the table lookup unnecessary.

-- Regular index: finds user_id, then goes to heap for email and name
CREATE INDEX idx_users_id ON users (id);

-- Covering index: stores email and name in the index itself
CREATE INDEX idx_users_id_covering ON users (id) INCLUDE (email, name);

-- This query is now an "index-only scan" -- no heap access
SELECT email, name FROM users WHERE id = 42;

The trade-off: the index is bigger. But for hot queries that run thousands of times per second, eliminating the heap lookup can cut latency in half.


Expression Indexes: Index Computed Values

Sometimes you don't query the raw column. You query a transformation of it.

-- Problem: case-insensitive email lookup
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- PostgreSQL can't use idx_users_email because LOWER(email) != email

-- Solution: index the expression
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
-- Now the query uses this index

Other common expression indexes:

-- Index a JSONB field extraction
CREATE INDEX idx_events_type ON events ((data->>'type'));

-- Index date truncation for daily aggregation queries
CREATE INDEX idx_orders_day ON orders (date_trunc('day', created_at));

Reading EXPLAIN ANALYZE

You designed a query and added an index. How do you know if PostgreSQL actually uses it?

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42 AND status = 'pending';

Scan Types

Scan Type What It Means Good or Bad?
Seq Scan Reads every row in the table Bad for selective queries. Fine if you're reading >10% of the table.
Index Scan Looks up the index, then fetches each row from the heap Good. Standard indexed lookup.
Index Only Scan Reads everything from the index, never touches the heap Best. You have a covering index.
Bitmap Index Scan Builds a bitmap of matching pages, then reads those pages Good for medium selectivity. Combines multiple indexes.

Key Numbers to Check

Index Scan using idx_orders_user_status on orders
  Index Cond: (user_id = 42 AND status = 'pending')
  Rows Removed by Filter: 0           -- good: index is precise
  actual time=0.045..0.052             -- 52 microseconds total
  rows=3                               -- found 3 rows
  Planning Time: 0.2 ms
  Execution Time: 0.1 ms

Red flags:

  • Rows Removed by Filter: 50000 -- The index returns too many rows, then the filter removes most. Needs a better index.
  • Seq Scan on a table with millions of rows when you expect only a few results -- the planner chose not to use your index.
  • actual rows far exceeds estimated rows -- statistics are stale. Run ANALYZE.

Why PostgreSQL Might Ignore Your Index

  1. Table is small. A sequential scan on 1000 rows is faster than index overhead.
  2. Query returns >10% of rows. Seq scan is cheaper than random I/O from index.
  3. Statistics are outdated. Run ANALYZE tablename to refresh.
  4. Data type mismatch. Comparing int to text silently disables the index.
  5. Function wrapping. WHERE UPPER(email) = ... won't use an index on email. Need an expression index.

The Decision Table

Given a query pattern, which index type should you choose?

Query Pattern Index Type Example
WHERE id = 42 B-tree Primary key lookup
WHERE created_at > '2025-01-01' B-tree (or BRIN if append-only) Range filter
WHERE tags @> ARRAY['postgres'] GIN Array containment
WHERE doc @@ 'search query' GIN (with tsvector) Full-text search
WHERE data @> '{"type":"click"}' GIN JSONB containment
WHERE ST_DWithin(location, point, 500) GiST (PostGIS) Geospatial proximity
WHERE time_range && requested_range GiST Range overlap
WHERE log_time BETWEEN ... AND ... (append-only) BRIN Time-series data
WHERE status = 'active' (5% of rows) Partial B-tree Filtering rare values
WHERE LOWER(email) = ... Expression B-tree Computed value lookup

When in doubt, start with B-tree. Graduate to specialized indexes only when you can explain why B-tree doesn't fit.


Patterns for System Design Interviews

Pattern 1: "Design a search feature"

"For under 5 million documents, I'd use PostgreSQL's full-text search with a GIN index on a tsvector column. It supports ranking, stemming, and phrase matching. Above that threshold, I'd add Elasticsearch as a read-optimized replica."

Pattern 2: "Design a location-based service"

"Store coordinates as PostGIS geometry types with a GiST index. For nearest-neighbor queries, use the <-> distance operator with ORDER BY ... LIMIT K. This gives us indexed K-nearest-neighbor search without computing distances for every row."

Pattern 3: "This table has 500 million rows and queries are slow"

"First, check EXPLAIN ANALYZE. If we're doing Seq Scans, we either need an index or the index exists but the planner ignores it. If the table is append-only (like logs), consider BRIN -- it's 1000x smaller than B-tree. If we only ever query recent data, a partial index on WHERE created_at > NOW() - INTERVAL '30 days' keeps the index tiny."

Pattern 4: "How do you handle multi-tenant queries?"

"Composite index on (tenant_id, ...) with tenant_id as the leading column. Every query includes WHERE tenant_id = ? so the index prefix is always used. For extreme isolation, partition by tenant_id and each partition gets its own smaller index."


Trade-offs Table

Decision Upside Downside
More indexes Faster reads Slower writes (each INSERT/UPDATE touches every index)
Covering index (INCLUDE) Index-only scans, no heap lookup Larger index, more storage
GIN for full-text search Avoids Elasticsearch dependency Slower writes, 3-10x longer index build
BRIN on time-series data Tiny index, minimal write overhead Only works for physically ordered data
Partial index Smaller, faster, targeted Must match WHERE clause exactly
Expression index Enables computed-value lookups Must match function call exactly in queries
Composite index Multi-column query support Column order must match query patterns

Index Decision

Interview Gotchas

Gotcha 1: 'Add an index on every column'

Each index slows writes. A table with 10 indexes means every INSERT does 11 writes (1 heap + 10 indexes). For write-heavy tables, this is a death sentence. Instagram's engineering team famously keeps index counts minimal on their highest-write tables.

Gotcha 2: 'PostgreSQL can replace Elasticsearch'

It can -- up to a point. PostgreSQL's GIN full-text search lacks Elasticsearch's fuzzy matching, custom analyzers, and distributed search. For a product search page with 100 million SKUs, typo tolerance, and faceted filtering, you'll outgrow PostgreSQL. For blog search or internal tools, GIN is more than enough.

Gotcha 3: 'B-tree works for everything'

Try doing a geospatial radius search with a B-tree. You can't. B-tree understands less-than and greater-than on scalar values. It has no concept of "distance" or "containment" or "overlap." That's why GiST and GIN exist.

Gotcha 4: 'Index order doesn't matter'

A composite index on (a, b) is useless for WHERE b = ?. The phone book analogy: you can look up all Smiths, but you can't look up everyone named John across all last names. Leading column first. Always.

Gotcha 5: 'We have an index, so queries are fast'

An index PostgreSQL ignores is worse than no index -- it wastes disk and slows writes for zero read benefit. Always check EXPLAIN ANALYZE. If you see Seq Scan on a large table when you expected Index Scan, something is wrong. Stale statistics, type mismatch, or the planner determined the index is not worth it.


Key Takeaways

  1. B-tree is your default. Use it until you have a reason not to.
  2. GIN replaces Elasticsearch for small-to-medium datasets. One fewer service in your architecture.
  3. GiST is mandatory for geospatial. PostGIS + GiST is how ride-sharing apps match drivers.
  4. BRIN is magic for time-series. 1000x smaller than B-tree. Use it for logs and events.
  5. Partial and expression indexes are free performance. Most candidates don't know they exist. You should.
  6. Always verify with EXPLAIN ANALYZE. An unused index is a silent tax on every write.