Partitioning, Replication, and Connection Pooling
TL;DR
A single PostgreSQL instance can handle far more than most teams think -- Instagram ran 30 million users on sharded PostgreSQL with a 3-person backend team. But you need three operational tools to get there: partitioning for large tables, replication for read scaling, and connection pooling to stop PostgreSQL from eating 10GB of RAM just on connection overhead.
Table Partitioning

Partitioning splits one logical table into multiple physical tables. The database handles routing queries to the right partition. Your application code doesn't change.
Why Partition?
Three reasons, in order of importance:
- Fast data removal.
DROP TABLE partition_2024_01is instant.DELETE FROM orders WHERE created_at < '2024-02-01'on a 500M row table takes hours and generates massive WAL traffic. - Query performance. If queries always filter by the partition key, PostgreSQL scans only the relevant partition instead of the whole table.
- Maintenance isolation. VACUUM, REINDEX, and backups operate on smaller chunks.
Declarative Partitioning (PostgreSQL 10+)
Before PostgreSQL 10, you had to manually create child tables with CHECK constraints and write trigger functions to route inserts. It was ugly. Declarative partitioning made it clean.
Range Partitioning (Most Common)
-- Partition orders by month
CREATE TABLE orders (
id bigserial,
user_id bigint NOT NULL,
created_at timestamptz NOT NULL,
total numeric(10,2),
status text
) PARTITION BY RANGE (created_at);
-- Create monthly partitions
CREATE TABLE orders_2025_01 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE orders_2025_02 PARTITION OF orders
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
CREATE TABLE orders_2025_03 PARTITION OF orders
FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');
-- Queries automatically hit only the relevant partition
EXPLAIN SELECT * FROM orders WHERE created_at = '2025-02-15';
-- Shows: Seq Scan on orders_2025_02 (partition pruning!)
List Partitioning
-- Partition by region for data residency compliance
CREATE TABLE users (
id bigserial,
email text,
region text NOT NULL
) PARTITION BY LIST (region);
CREATE TABLE users_us PARTITION OF users FOR VALUES IN ('us-east', 'us-west');
CREATE TABLE users_eu PARTITION OF users FOR VALUES IN ('eu-west', 'eu-central');
CREATE TABLE users_ap PARTITION OF users FOR VALUES IN ('ap-south', 'ap-east');
Data residency is a real use case. If EU regulations require European user data to stay on EU servers, list partitioning lets you store users_eu on a physically separate tablespace while keeping a unified query interface.
Hash Partitioning
-- Distribute evenly by user_id (useful for multi-tenant)
CREATE TABLE events (
id bigserial,
user_id bigint NOT NULL,
event_type text,
payload jsonb
) PARTITION BY HASH (user_id);
CREATE TABLE events_p0 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Hash partitioning spreads data evenly but you can't prune partitions on range queries. Use it when you want uniform distribution and queries always include the partition key.
When to Partition
| Signal | Partition? | Why |
|---|---|---|
| Table > 100M rows | Probably | VACUUM and indexes become painful |
| Need to drop old data regularly | Yes | DROP TABLE vs multi-hour DELETE |
| Queries always filter by date/region/tenant | Yes | Partition pruning speeds reads |
| Table < 10M rows | No | Overhead isn't worth it |
| Queries don't include partition key | No | Every query scans all partitions (worse than no partitioning) |
Spicy opinion: Teams partition too early. If your table has 5 million rows and your queries are slow, the problem is missing indexes, not missing partitions. Partitioning adds operational complexity -- creating new partitions, managing partition-aware migrations, testing across boundaries. Don't reach for it until you actually need it.
Partition Pitfalls
- Unique constraints must include the partition key. You can't have a global unique constraint on
idalone -- it must be(id, created_at)if you partition bycreated_at. - Foreign keys to partitioned tables only work from PostgreSQL 12+. And they're still limited.
- Too many partitions hurt. Planning time grows with partition count. 10,000 daily partitions over 30 years = 10,950 partitions. PostgreSQL's planner will choke. Monthly or weekly partitions are usually right.
- Default partition catches strays. Always create one:
CREATE TABLE orders_default PARTITION OF orders DEFAULT;Otherwise INSERTs with unexpected values fail with an ERROR ("no partition of relation found for row").
Replication
PostgreSQL is a single-primary database. All writes go to one node. Replication copies data to standby nodes for read scaling and high availability.
Streaming Replication
The primary writes WAL (Write-Ahead Log) records. Standby nodes receive these WAL records over a TCP connection and replay them.
Primary → WAL stream → Standby 1 (read replica)
→ Standby 2 (read replica)
→ Standby 3 (disaster recovery)
Asynchronous (Default)
The primary doesn't wait for standbys to confirm receipt. Fastest writes, but if the primary crashes, you might lose the last few transactions.
-- On the primary: check replication status
SELECT client_addr, state, sent_lsn, replay_lsn,
sent_lsn - replay_lsn AS replication_lag
FROM pg_stat_replication;
Synchronous
The primary waits for at least one standby to confirm it received (or replayed) the WAL before committing. Zero data loss, but write latency increases by the network round-trip time.
-- postgresql.conf on primary
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
synchronous_commit = on -- or 'remote_apply' for strongest guarantee
The trade-off is simple: async = faster writes, possible data loss on crash. Sync = slower writes, zero data loss. Most production systems use async with one sync standby for the "last resort" failover target.
Logical Replication
Streaming replication copies everything. Logical replication lets you selectively replicate specific tables or even filter rows. It decodes WAL into logical change events.
-- On primary: create a publication
CREATE PUBLICATION orders_pub FOR TABLE orders, order_items;
-- On subscriber (can be a different PG version!)
CREATE SUBSCRIPTION orders_sub
CONNECTION 'host=primary dbname=mydb'
PUBLICATION orders_pub;
Use cases:
- Zero-downtime major version upgrades. Set up logical replication from PG 14 to PG 16, let it catch up, switch traffic.
- Selective data sharing. Replicate the
productstable to a read-only analytics database without copyingusersorpayments. - Multi-region partial replication. Each region gets the subset of data it needs.
Read Replicas: The Read Scaling Pattern
Route writes to the primary. Route reads to replicas. This is the most common PostgreSQL scaling pattern.
Application
├── Writes → Primary (single node)
└── Reads → Load balancer → Replica 1
→ Replica 2
→ Replica 3
Replication Lag
Replicas are eventually consistent. The lag is usually under 1 second but can spike during:
- Large batch writes or migrations
- Standby running expensive queries (resource contention)
- Network issues between primary and standby
The consistency problem: User writes a comment, then refreshes the page. The read goes to a replica that hasn't received the comment yet. User thinks their comment was lost.
Solutions:
- Read-your-own-writes. After a write, route that user's reads to the primary for a few seconds.
- Monotonic reads. Track the LSN (Log Sequence Number) of the last write and wait until the replica has reached that LSN before serving the read.
- Session stickiness. Pin a user's session to a specific replica (not great -- uneven load).
-- Check if replica has caught up to a specific LSN
SELECT pg_last_wal_replay_lsn() >= '0/1A3C5D8'::pg_lsn;
GitHub uses read replicas extensively. Their gh-ost tool for schema migrations was built specifically because ALTER TABLE on a primary blocks writes, and they needed a way to do schema changes without downtime across their replica fleet.
Connection Pooling with PgBouncer
This is the single most impactful operational optimization for PostgreSQL. Yet most candidates never mention it in interviews.
The Problem
PostgreSQL forks a new OS process for every client connection. Each process consumes ~10MB of RAM. The math is brutal:
| Connections | RAM for Connection Overhead |
|---|---|
| 100 | ~1 GB |
| 500 | ~5 GB |
| 1,000 | ~10 GB |
| 5,000 | ~50 GB |
A modern microservices architecture with 20 services, each running 10 pods, each with a connection pool of 10 = 2,000 connections. That's 20GB of RAM just for connection management, before PostgreSQL does any actual work.
Worse: PostgreSQL's max_connections default is 100. Each connection beyond ~300 degrades performance due to process scheduling and shared buffer contention. More connections doesn't mean more throughput. It means less.
PgBouncer: The Fix
PgBouncer sits between your application and PostgreSQL. It maintains a small pool of actual PostgreSQL connections and multiplexes hundreds of client connections onto them.
20 services × 10 pods × 10 connections = 2,000 application connections
↓
PgBouncer
↓
30 actual PostgreSQL connections
2,000 connections to PgBouncer, 30 to PostgreSQL. PgBouncer processes are lightweight (a few KB each, not 10MB).
PgBouncer Pooling Modes
| Mode | How It Works | Best For | Limitation |
|---|---|---|---|
| Session | Client gets a dedicated PG connection for the entire session | Apps using session-level features (LISTEN/NOTIFY, temp tables) | Minimal connection saving |
| Transaction | Client gets a PG connection for one transaction, then releases it | Most web applications | Can't use session-level features |
| Statement | Client gets a PG connection for one statement | Simple read-only queries | Can't use multi-statement transactions |
Transaction mode is the default choice. It gives you the best connection multiplexing while still supporting transactions. The only things it breaks are session-level features like prepared statements, SET commands, and LISTEN/NOTIFY.
# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 30
min_pool_size = 10
reserve_pool_size = 5
reserve_pool_timeout = 3
Why Not Just Increase max_connections?
Because PostgreSQL performance degrades after ~300 connections regardless of available RAM. The bottleneck shifts to:
- Process context switching. Each connection is an OS process. The kernel scheduler thrashes with 1000+ processes.
- Lock contention. More concurrent processes = more contention on lightweight locks for shared buffers.
- Snapshot overhead. Each transaction snapshot must check against all active transaction IDs. More connections = bigger snapshots.
Benchmarks consistently show: 50 active PostgreSQL connections outperform 500 active connections on the same hardware, doing the same workload. This is unintuitive and trips up candidates who think "more connections = more parallelism."
Horizontal Sharding with Citus
When vertical scaling and read replicas aren't enough, you shard. Citus is a PostgreSQL extension (now part of Azure) that distributes tables across multiple PostgreSQL nodes.
-- Distribute the orders table by user_id
SELECT create_distributed_table('orders', 'user_id');
-- Queries that include user_id in WHERE clause are routed to one shard
SELECT * FROM orders WHERE user_id = 42; -- goes to one node
-- Queries without user_id hit all shards (expensive)
SELECT count(*) FROM orders WHERE status = 'pending'; -- scatter-gather
When Citus Makes Sense
| Pattern | Citus Fits? | Why |
|---|---|---|
| Multi-tenant SaaS (shard by tenant_id) | Yes | Each tenant's data lives on one shard. Queries are local. |
| Time-series analytics (shard by time) | Yes | Each time chunk is a shard. Old shards can be archived. |
| Social graph (query across users) | No | Friend-of-friend queries hit every shard. |
| Global search (no natural shard key) | No | Every query is scatter-gather. |
The key constraint: Citus works when your queries naturally include the shard key. If they don't, every query becomes a distributed scatter-gather operation and performance is worse than a single large PostgreSQL instance.
Aurora PostgreSQL
Amazon Aurora is PostgreSQL-compatible with a rewritten storage layer. It separates compute from storage, which changes the operational model.
How It Differs from Plain PostgreSQL
| Aspect | PostgreSQL on EC2/RDS | Aurora PostgreSQL |
|---|---|---|
| Storage | EBS volumes, you manage growth | Auto-scales to 128 TB, replicated 6 ways across 3 AZs |
| Replication | WAL streaming (minutes of lag possible) | Shared storage (replica lag typically <20ms) |
| Failover | Promote standby (30-120 seconds) | Automatic, typically <30 seconds |
| Backups | Manual snapshots or continuous archiving | Continuous, automatic, point-in-time recovery |
| Cost | Cheaper at small scale | 20-50% more expensive, but less ops burden |
When Aurora Makes Sense
Aurora shines when you want PostgreSQL semantics without the operational burden of managing replication, failover, and storage. It does NOT shard for you -- it's still a single-primary database. You get resilient storage and fast failover, not horizontal write scaling.
Spicy opinion: Aurora is the right default for any startup using PostgreSQL on AWS. The 20% cost premium pays for itself the first time you avoid a 2 AM storage emergency or a manual failover. Self-managing PostgreSQL replication is a skill you don't need unless you're saving more than $50K/month on database costs.
When PostgreSQL Is NOT the Right Choice
PostgreSQL is phenomenal. It's not for everything.
| Scenario | Why Not PostgreSQL | Better Option |
|---|---|---|
| Single table > 10TB | VACUUM, indexes, and backups become unmanageable | Citus, CockroachDB, or purpose-built OLAP (ClickHouse) |
| > 100K writes/sec sustained | Single-primary bottleneck | Cassandra, ScyllaDB, DynamoDB |
| Schema-less event streams | Rigid schema slows iteration; JSONB works but isn't native | MongoDB, DynamoDB |
| Full-text search > 50M docs with fuzzy matching | GIN full-text search lacks fuzzy/typo tolerance at scale | Elasticsearch |
| Graph traversals (friend-of-friend-of-friend) | Recursive CTEs work but don't scale | Neo4j, Neptune |
| Sub-millisecond key-value lookups | PostgreSQL's process-per-connection model adds latency | Redis, Memcached |
The right answer in an interview is never "PostgreSQL for everything." It's "PostgreSQL as the primary datastore, with specialized systems for workloads that don't fit."
Instagram's PostgreSQL Story
Instagram ran on sharded PostgreSQL from launch. At 30 million users, they had:
- Sharded PostgreSQL for user data, photos metadata, likes, comments
- 3 backend engineers managing the entire data layer
- Custom sharding logic that hashed user IDs to one of several thousand logical shards, mapped to a few dozen physical PostgreSQL instances
They didn't use Cassandra. They didn't use MongoDB. They used the database their team knew best and scaled it with application-level sharding plus PgBouncer for connection management.
The lesson: PostgreSQL scales further than you think. The bottleneck is usually missing indexes, bad connection management, or untuned autovacuum -- not the database itself.
Patterns for System Design Interviews
Pattern 1: "Design a system that stores 2 years of order history"
"Partition orders by month using range partitioning on created_at. Current month is hot -- optimized for writes with aggressive autovacuum. Historical partitions are rarely written, so VACUUM overhead is minimal. Dropping data older than 2 years is instant: DROP TABLE orders_2023_01. No DELETE, no VACUUM, no bloat."
Pattern 2: "How do you handle read-heavy traffic?"
"Read replicas behind a load balancer. Async streaming replication gives us sub-second lag. For the consistency problem, route reads after writes to the primary for 2 seconds using application-level routing. PgBouncer in front of each instance multiplexes application connections down to ~30 actual PostgreSQL connections."
Pattern 3: "Your database is running out of connections"
"PgBouncer in transaction mode. 2,000 application connections map to 30 PostgreSQL connections. Each connection only holds a server-side connection during an active transaction, then releases it. This alone typically solves the problem. If not, audit connection pool sizes -- most ORMs default to 10 connections per pod, which is too high."
Pattern 4: "Multi-tenant SaaS data isolation"
"Three options in ascending isolation: shared table with tenant_id column and row-level security, schema-per-tenant (each tenant gets their own PostgreSQL schema), or database-per-tenant. For most SaaS: shared table with Citus sharding by tenant_id. Queries include tenant_id by default (enforced by RLS), and Citus routes them to the right shard."
Trade-offs Table
| Decision | Upside | Downside |
|---|---|---|
| Range partitioning by date | Fast data removal, partition pruning on time queries | Unique constraints must include partition key |
| Async streaming replication | No write latency hit, simple setup | Possible data loss on primary crash (seconds of WAL) |
| Sync replication | Zero data loss | Write latency increases by network round-trip |
| Read replicas | Scales reads linearly | Replication lag causes stale reads |
| PgBouncer (transaction mode) | 10-100x connection multiplexing | Can't use session-level features (prepared statements, LISTEN/NOTIFY) |
| Citus sharding | Horizontal write scaling | Cross-shard queries are expensive; shard key in every query |
| Aurora PostgreSQL | Managed failover, auto-scaling storage | 20-50% cost premium; vendor lock-in |
| Hash partitioning | Even data distribution | No range-based partition pruning |

Interview Gotchas
Gotcha 1: 'We'll just add more replicas for writes'
Replicas are read-only. All writes go to the primary. Adding 10 replicas gives you 10x read capacity and exactly 1x write capacity. For write scaling, you need sharding (Citus) or a different database.
Gotcha 2: 'Set max_connections to 10,000'
PostgreSQL performance degrades after ~300 connections. More connections means more context switching, more lock contention, and bigger snapshots. PgBouncer is the answer, not max_connections.
Gotcha 3: 'Partition every table'
Partitioning adds planning overhead. For small tables (under 10M rows), the partition routing cost exceeds the benefit. Only partition tables where you have a concrete reason: data retention, query performance on the partition key, or maintenance isolation.
Gotcha 4: 'Replication lag is always low'
Under normal conditions, yes -- sub-second. But large transactions (batch updates, schema migrations), resource contention on replicas (expensive analytics queries), and network issues can push lag to minutes. Design for eventual consistency on replicas. Use read-your-own-writes patterns for critical paths.
Gotcha 5: 'PostgreSQL can scale infinitely'
No single-primary database can. PostgreSQL tops out around 50K-100K writes/sec depending on hardware and row size. Beyond that, you need sharding, partitioning across nodes, or a fundamentally different architecture. The Instagram story is inspiring but they had custom sharding -- not a single PostgreSQL instance serving 30 million users.
Key Takeaways
- Partition for data lifecycle, not just performance. The ability to
DROP TABLEinstead ofDELETE FROMis the real win. - Replication is for read scaling and HA. It does not help with write throughput.
- PgBouncer is non-negotiable in production. Any PostgreSQL deployment handling more than a few hundred connections needs it. Full stop.
- Aurora is the right default on AWS. Self-managing PostgreSQL replication is only worth it at significant scale.
- PostgreSQL scales further than you think. Before reaching for NoSQL, ask: have you tried partitioning, read replicas, PgBouncer, and better indexes? Instagram's 3-engineer team says hi.