Skip to content

Partitioning, Replication, and Connection Pooling

TL;DR

A single PostgreSQL instance can handle far more than most teams think -- Instagram ran 30 million users on sharded PostgreSQL with a 3-person backend team. But you need three operational tools to get there: partitioning for large tables, replication for read scaling, and connection pooling to stop PostgreSQL from eating 10GB of RAM just on connection overhead.


Table Partitioning

PgBouncer

Partitioning splits one logical table into multiple physical tables. The database handles routing queries to the right partition. Your application code doesn't change.

Why Partition?

Three reasons, in order of importance:

  1. Fast data removal. DROP TABLE partition_2024_01 is instant. DELETE FROM orders WHERE created_at < '2024-02-01' on a 500M row table takes hours and generates massive WAL traffic.
  2. Query performance. If queries always filter by the partition key, PostgreSQL scans only the relevant partition instead of the whole table.
  3. Maintenance isolation. VACUUM, REINDEX, and backups operate on smaller chunks.

Declarative Partitioning (PostgreSQL 10+)

Before PostgreSQL 10, you had to manually create child tables with CHECK constraints and write trigger functions to route inserts. It was ugly. Declarative partitioning made it clean.

Range Partitioning (Most Common)

-- Partition orders by month
CREATE TABLE orders (
    id bigserial,
    user_id bigint NOT NULL,
    created_at timestamptz NOT NULL,
    total numeric(10,2),
    status text
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE orders_2025_01 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE orders_2025_02 PARTITION OF orders
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
CREATE TABLE orders_2025_03 PARTITION OF orders
    FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');

-- Queries automatically hit only the relevant partition
EXPLAIN SELECT * FROM orders WHERE created_at = '2025-02-15';
-- Shows: Seq Scan on orders_2025_02 (partition pruning!)

List Partitioning

-- Partition by region for data residency compliance
CREATE TABLE users (
    id bigserial,
    email text,
    region text NOT NULL
) PARTITION BY LIST (region);

CREATE TABLE users_us PARTITION OF users FOR VALUES IN ('us-east', 'us-west');
CREATE TABLE users_eu PARTITION OF users FOR VALUES IN ('eu-west', 'eu-central');
CREATE TABLE users_ap PARTITION OF users FOR VALUES IN ('ap-south', 'ap-east');

Data residency is a real use case. If EU regulations require European user data to stay on EU servers, list partitioning lets you store users_eu on a physically separate tablespace while keeping a unified query interface.

Hash Partitioning

-- Distribute evenly by user_id (useful for multi-tenant)
CREATE TABLE events (
    id bigserial,
    user_id bigint NOT NULL,
    event_type text,
    payload jsonb
) PARTITION BY HASH (user_id);

CREATE TABLE events_p0 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Hash partitioning spreads data evenly but you can't prune partitions on range queries. Use it when you want uniform distribution and queries always include the partition key.

When to Partition

Signal Partition? Why
Table > 100M rows Probably VACUUM and indexes become painful
Need to drop old data regularly Yes DROP TABLE vs multi-hour DELETE
Queries always filter by date/region/tenant Yes Partition pruning speeds reads
Table < 10M rows No Overhead isn't worth it
Queries don't include partition key No Every query scans all partitions (worse than no partitioning)

Spicy opinion: Teams partition too early. If your table has 5 million rows and your queries are slow, the problem is missing indexes, not missing partitions. Partitioning adds operational complexity -- creating new partitions, managing partition-aware migrations, testing across boundaries. Don't reach for it until you actually need it.

Partition Pitfalls

  1. Unique constraints must include the partition key. You can't have a global unique constraint on id alone -- it must be (id, created_at) if you partition by created_at.
  2. Foreign keys to partitioned tables only work from PostgreSQL 12+. And they're still limited.
  3. Too many partitions hurt. Planning time grows with partition count. 10,000 daily partitions over 30 years = 10,950 partitions. PostgreSQL's planner will choke. Monthly or weekly partitions are usually right.
  4. Default partition catches strays. Always create one: CREATE TABLE orders_default PARTITION OF orders DEFAULT; Otherwise INSERTs with unexpected values fail with an ERROR ("no partition of relation found for row").

Replication

PostgreSQL is a single-primary database. All writes go to one node. Replication copies data to standby nodes for read scaling and high availability.

Streaming Replication

The primary writes WAL (Write-Ahead Log) records. Standby nodes receive these WAL records over a TCP connection and replay them.

Primary → WAL stream → Standby 1 (read replica)
                     → Standby 2 (read replica)
                     → Standby 3 (disaster recovery)

Asynchronous (Default)

The primary doesn't wait for standbys to confirm receipt. Fastest writes, but if the primary crashes, you might lose the last few transactions.

-- On the primary: check replication status
SELECT client_addr, state, sent_lsn, replay_lsn,
       sent_lsn - replay_lsn AS replication_lag
FROM pg_stat_replication;

Synchronous

The primary waits for at least one standby to confirm it received (or replayed) the WAL before committing. Zero data loss, but write latency increases by the network round-trip time.

-- postgresql.conf on primary
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
synchronous_commit = on  -- or 'remote_apply' for strongest guarantee

The trade-off is simple: async = faster writes, possible data loss on crash. Sync = slower writes, zero data loss. Most production systems use async with one sync standby for the "last resort" failover target.

Logical Replication

Streaming replication copies everything. Logical replication lets you selectively replicate specific tables or even filter rows. It decodes WAL into logical change events.

-- On primary: create a publication
CREATE PUBLICATION orders_pub FOR TABLE orders, order_items;

-- On subscriber (can be a different PG version!)
CREATE SUBSCRIPTION orders_sub
    CONNECTION 'host=primary dbname=mydb'
    PUBLICATION orders_pub;

Use cases:

  • Zero-downtime major version upgrades. Set up logical replication from PG 14 to PG 16, let it catch up, switch traffic.
  • Selective data sharing. Replicate the products table to a read-only analytics database without copying users or payments.
  • Multi-region partial replication. Each region gets the subset of data it needs.

Read Replicas: The Read Scaling Pattern

Route writes to the primary. Route reads to replicas. This is the most common PostgreSQL scaling pattern.

Application
    ├── Writes → Primary (single node)
    └── Reads  → Load balancer → Replica 1
                               → Replica 2
                               → Replica 3

Replication Lag

Replicas are eventually consistent. The lag is usually under 1 second but can spike during:

  • Large batch writes or migrations
  • Standby running expensive queries (resource contention)
  • Network issues between primary and standby

The consistency problem: User writes a comment, then refreshes the page. The read goes to a replica that hasn't received the comment yet. User thinks their comment was lost.

Solutions:

  1. Read-your-own-writes. After a write, route that user's reads to the primary for a few seconds.
  2. Monotonic reads. Track the LSN (Log Sequence Number) of the last write and wait until the replica has reached that LSN before serving the read.
  3. Session stickiness. Pin a user's session to a specific replica (not great -- uneven load).
-- Check if replica has caught up to a specific LSN
SELECT pg_last_wal_replay_lsn() >= '0/1A3C5D8'::pg_lsn;

GitHub uses read replicas extensively. Their gh-ost tool for schema migrations was built specifically because ALTER TABLE on a primary blocks writes, and they needed a way to do schema changes without downtime across their replica fleet.


Connection Pooling with PgBouncer

This is the single most impactful operational optimization for PostgreSQL. Yet most candidates never mention it in interviews.

The Problem

PostgreSQL forks a new OS process for every client connection. Each process consumes ~10MB of RAM. The math is brutal:

Connections RAM for Connection Overhead
100 ~1 GB
500 ~5 GB
1,000 ~10 GB
5,000 ~50 GB

A modern microservices architecture with 20 services, each running 10 pods, each with a connection pool of 10 = 2,000 connections. That's 20GB of RAM just for connection management, before PostgreSQL does any actual work.

Worse: PostgreSQL's max_connections default is 100. Each connection beyond ~300 degrades performance due to process scheduling and shared buffer contention. More connections doesn't mean more throughput. It means less.

PgBouncer: The Fix

PgBouncer sits between your application and PostgreSQL. It maintains a small pool of actual PostgreSQL connections and multiplexes hundreds of client connections onto them.

20 services × 10 pods × 10 connections = 2,000 application connections
                PgBouncer
        30 actual PostgreSQL connections

2,000 connections to PgBouncer, 30 to PostgreSQL. PgBouncer processes are lightweight (a few KB each, not 10MB).

PgBouncer Pooling Modes

Mode How It Works Best For Limitation
Session Client gets a dedicated PG connection for the entire session Apps using session-level features (LISTEN/NOTIFY, temp tables) Minimal connection saving
Transaction Client gets a PG connection for one transaction, then releases it Most web applications Can't use session-level features
Statement Client gets a PG connection for one statement Simple read-only queries Can't use multi-statement transactions

Transaction mode is the default choice. It gives you the best connection multiplexing while still supporting transactions. The only things it breaks are session-level features like prepared statements, SET commands, and LISTEN/NOTIFY.

# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 30
min_pool_size = 10
reserve_pool_size = 5
reserve_pool_timeout = 3

Why Not Just Increase max_connections?

Because PostgreSQL performance degrades after ~300 connections regardless of available RAM. The bottleneck shifts to:

  • Process context switching. Each connection is an OS process. The kernel scheduler thrashes with 1000+ processes.
  • Lock contention. More concurrent processes = more contention on lightweight locks for shared buffers.
  • Snapshot overhead. Each transaction snapshot must check against all active transaction IDs. More connections = bigger snapshots.

Benchmarks consistently show: 50 active PostgreSQL connections outperform 500 active connections on the same hardware, doing the same workload. This is unintuitive and trips up candidates who think "more connections = more parallelism."


Horizontal Sharding with Citus

When vertical scaling and read replicas aren't enough, you shard. Citus is a PostgreSQL extension (now part of Azure) that distributes tables across multiple PostgreSQL nodes.

-- Distribute the orders table by user_id
SELECT create_distributed_table('orders', 'user_id');

-- Queries that include user_id in WHERE clause are routed to one shard
SELECT * FROM orders WHERE user_id = 42;  -- goes to one node

-- Queries without user_id hit all shards (expensive)
SELECT count(*) FROM orders WHERE status = 'pending';  -- scatter-gather

When Citus Makes Sense

Pattern Citus Fits? Why
Multi-tenant SaaS (shard by tenant_id) Yes Each tenant's data lives on one shard. Queries are local.
Time-series analytics (shard by time) Yes Each time chunk is a shard. Old shards can be archived.
Social graph (query across users) No Friend-of-friend queries hit every shard.
Global search (no natural shard key) No Every query is scatter-gather.

The key constraint: Citus works when your queries naturally include the shard key. If they don't, every query becomes a distributed scatter-gather operation and performance is worse than a single large PostgreSQL instance.


Aurora PostgreSQL

Amazon Aurora is PostgreSQL-compatible with a rewritten storage layer. It separates compute from storage, which changes the operational model.

How It Differs from Plain PostgreSQL

Aspect PostgreSQL on EC2/RDS Aurora PostgreSQL
Storage EBS volumes, you manage growth Auto-scales to 128 TB, replicated 6 ways across 3 AZs
Replication WAL streaming (minutes of lag possible) Shared storage (replica lag typically <20ms)
Failover Promote standby (30-120 seconds) Automatic, typically <30 seconds
Backups Manual snapshots or continuous archiving Continuous, automatic, point-in-time recovery
Cost Cheaper at small scale 20-50% more expensive, but less ops burden

When Aurora Makes Sense

Aurora shines when you want PostgreSQL semantics without the operational burden of managing replication, failover, and storage. It does NOT shard for you -- it's still a single-primary database. You get resilient storage and fast failover, not horizontal write scaling.

Spicy opinion: Aurora is the right default for any startup using PostgreSQL on AWS. The 20% cost premium pays for itself the first time you avoid a 2 AM storage emergency or a manual failover. Self-managing PostgreSQL replication is a skill you don't need unless you're saving more than $50K/month on database costs.


When PostgreSQL Is NOT the Right Choice

PostgreSQL is phenomenal. It's not for everything.

Scenario Why Not PostgreSQL Better Option
Single table > 10TB VACUUM, indexes, and backups become unmanageable Citus, CockroachDB, or purpose-built OLAP (ClickHouse)
> 100K writes/sec sustained Single-primary bottleneck Cassandra, ScyllaDB, DynamoDB
Schema-less event streams Rigid schema slows iteration; JSONB works but isn't native MongoDB, DynamoDB
Full-text search > 50M docs with fuzzy matching GIN full-text search lacks fuzzy/typo tolerance at scale Elasticsearch
Graph traversals (friend-of-friend-of-friend) Recursive CTEs work but don't scale Neo4j, Neptune
Sub-millisecond key-value lookups PostgreSQL's process-per-connection model adds latency Redis, Memcached

The right answer in an interview is never "PostgreSQL for everything." It's "PostgreSQL as the primary datastore, with specialized systems for workloads that don't fit."


Instagram's PostgreSQL Story

Instagram ran on sharded PostgreSQL from launch. At 30 million users, they had:

  • Sharded PostgreSQL for user data, photos metadata, likes, comments
  • 3 backend engineers managing the entire data layer
  • Custom sharding logic that hashed user IDs to one of several thousand logical shards, mapped to a few dozen physical PostgreSQL instances

They didn't use Cassandra. They didn't use MongoDB. They used the database their team knew best and scaled it with application-level sharding plus PgBouncer for connection management.

The lesson: PostgreSQL scales further than you think. The bottleneck is usually missing indexes, bad connection management, or untuned autovacuum -- not the database itself.


Patterns for System Design Interviews

Pattern 1: "Design a system that stores 2 years of order history"

"Partition orders by month using range partitioning on created_at. Current month is hot -- optimized for writes with aggressive autovacuum. Historical partitions are rarely written, so VACUUM overhead is minimal. Dropping data older than 2 years is instant: DROP TABLE orders_2023_01. No DELETE, no VACUUM, no bloat."

Pattern 2: "How do you handle read-heavy traffic?"

"Read replicas behind a load balancer. Async streaming replication gives us sub-second lag. For the consistency problem, route reads after writes to the primary for 2 seconds using application-level routing. PgBouncer in front of each instance multiplexes application connections down to ~30 actual PostgreSQL connections."

Pattern 3: "Your database is running out of connections"

"PgBouncer in transaction mode. 2,000 application connections map to 30 PostgreSQL connections. Each connection only holds a server-side connection during an active transaction, then releases it. This alone typically solves the problem. If not, audit connection pool sizes -- most ORMs default to 10 connections per pod, which is too high."

Pattern 4: "Multi-tenant SaaS data isolation"

"Three options in ascending isolation: shared table with tenant_id column and row-level security, schema-per-tenant (each tenant gets their own PostgreSQL schema), or database-per-tenant. For most SaaS: shared table with Citus sharding by tenant_id. Queries include tenant_id by default (enforced by RLS), and Citus routes them to the right shard."


Trade-offs Table

Decision Upside Downside
Range partitioning by date Fast data removal, partition pruning on time queries Unique constraints must include partition key
Async streaming replication No write latency hit, simple setup Possible data loss on primary crash (seconds of WAL)
Sync replication Zero data loss Write latency increases by network round-trip
Read replicas Scales reads linearly Replication lag causes stale reads
PgBouncer (transaction mode) 10-100x connection multiplexing Can't use session-level features (prepared statements, LISTEN/NOTIFY)
Citus sharding Horizontal write scaling Cross-shard queries are expensive; shard key in every query
Aurora PostgreSQL Managed failover, auto-scaling storage 20-50% cost premium; vendor lock-in
Hash partitioning Even data distribution No range-based partition pruning

Streaming Replication

Interview Gotchas

Gotcha 1: 'We'll just add more replicas for writes'

Replicas are read-only. All writes go to the primary. Adding 10 replicas gives you 10x read capacity and exactly 1x write capacity. For write scaling, you need sharding (Citus) or a different database.

Gotcha 2: 'Set max_connections to 10,000'

PostgreSQL performance degrades after ~300 connections. More connections means more context switching, more lock contention, and bigger snapshots. PgBouncer is the answer, not max_connections.

Gotcha 3: 'Partition every table'

Partitioning adds planning overhead. For small tables (under 10M rows), the partition routing cost exceeds the benefit. Only partition tables where you have a concrete reason: data retention, query performance on the partition key, or maintenance isolation.

Gotcha 4: 'Replication lag is always low'

Under normal conditions, yes -- sub-second. But large transactions (batch updates, schema migrations), resource contention on replicas (expensive analytics queries), and network issues can push lag to minutes. Design for eventual consistency on replicas. Use read-your-own-writes patterns for critical paths.

Gotcha 5: 'PostgreSQL can scale infinitely'

No single-primary database can. PostgreSQL tops out around 50K-100K writes/sec depending on hardware and row size. Beyond that, you need sharding, partitioning across nodes, or a fundamentally different architecture. The Instagram story is inspiring but they had custom sharding -- not a single PostgreSQL instance serving 30 million users.


Key Takeaways

  1. Partition for data lifecycle, not just performance. The ability to DROP TABLE instead of DELETE FROM is the real win.
  2. Replication is for read scaling and HA. It does not help with write throughput.
  3. PgBouncer is non-negotiable in production. Any PostgreSQL deployment handling more than a few hundred connections needs it. Full stop.
  4. Aurora is the right default on AWS. Self-managing PostgreSQL replication is only worth it at significant scale.
  5. PostgreSQL scales further than you think. Before reaching for NoSQL, ask: have you tried partitioning, read replicas, PgBouncer, and better indexes? Instagram's 3-engineer team says hi.