ClickHouse vs Redshift vs BigQuery vs Snowflake

TL;DR

ClickHouse is the speed demon you self-host, Redshift is the AWS workhorse, BigQuery is Google's serverless magic, and Snowflake is the cloud-agnostic diplomat — pick based on your cloud, team, and whether you want to manage infrastructure.

Why This Comparison Matters

Columnar Compression

In a system design interview, you'll never need to justify ClickHouse vs Redshift in detail. Say "column store like ClickHouse or Redshift" and move on. But understanding why these tools exist differently teaches you real architectural trade-offs: managed vs self-hosted, compute-storage coupling vs separation, real-time vs batch, serverless vs provisioned.

These are the same trade-offs you make when designing any system. The OLAP landscape just makes them visible.

ClickHouse: The Speed Demon

Yandex built ClickHouse to power Yandex Metrica — their web analytics platform that processes 20+ billion events per day. They needed to count ad clicks, page views, and user sessions across billions of rows in real time. No existing tool was fast enough, so they built their own.

ClickHouse is open-source, self-hosted (or managed via ClickHouse Cloud), and absurdly fast for a specific workload: appending data and querying aggregations.

MergeTree Engine

The heart of ClickHouse is the MergeTree engine family. It's a log-structured merge tree optimized for column storage.

Insert Flow:
  Data arrives → written to in-memory buffer
  Buffer fills → flushed as a new "part" (sorted by primary key)
  Background merges combine small parts into larger parts
  → Similar to LSM trees, but column-oriented

On Disk:
  /data/table/part_1/
    ├── user_id.bin        (compressed column)
    ├── event_type.bin     (compressed column)
    ├── timestamp.bin      (compressed column)
    ├── primary.idx        (sparse index, every N-th row)
    └── minmax_timestamp.idx (zone map)

Sparse primary index: unlike B-trees that index every row, ClickHouse indexes every 8,192 rows (configurable). This keeps the index small enough to fit in memory even for billion-row tables. The trade-off: point lookups are slower, but ClickHouse doesn't care — it's built for scans.

What Makes ClickHouse Fast

Vectorized execution: processes columns in batches using SIMD instructions.
Aggressive compression: LZ4 by default, ZSTD for better ratios. 10-40x compression on real data.
Parallel processing: queries automatically use all CPU cores.
Specialized codecs: delta encoding for timestamps, Gorilla for floats, T64 for integers.
No transaction overhead: no MVCC, no row-level locking, no undo logs.

Spicy opinion: ClickHouse is the most underrated database in the industry. Teams spend $50K/month on Snowflake for workloads that ClickHouse could handle on three $500/month servers. The catch is you have to operate it yourself — and most teams underestimate that cost.

Who Uses ClickHouse

Cloudflare: analytics for their CDN (trillions of events)
Uber: real-time logging and analytics
eBay: user behavior analytics
GitLab: product analytics
Spotify: event processing pipeline

When ClickHouse

Real-time analytics (sub-second queries on fresh data)
High ingest rate (millions of rows per second)
Cost-sensitive (open-source, runs on commodity hardware)
Already have Kafka or a streaming pipeline feeding data in
Team can operate a database (upgrades, monitoring, backups)

When Not ClickHouse

Need managed service with zero ops (use BigQuery or Snowflake)
Need concurrent ad-hoc queries from many analysts (ClickHouse has limited concurrency)
Need to update/delete individual rows frequently (MergeTree is append-optimized)

Redshift: The AWS Workhorse

Amazon Redshift launched in 2012, based on ParAccel (a PostgreSQL fork for column storage). It's the default OLAP choice for AWS-native teams.

Architecture

Redshift was originally compute-storage coupled — your data lived on the same nodes that ran your queries. This meant you paid for compute to store cold data.

Original Redshift (DC2 nodes):
┌─────────────────────────────────┐
│ Leader Node (query planning)    │
├─────────────────────────────────┤
│ Compute Node 1                  │
│  ├── CPU + RAM                  │
│  └── Local SSD (data stored here)│
│ Compute Node 2                  │
│  ├── CPU + RAM                  │
│  └── Local SSD                  │
│ Compute Node 3                  │
│  ├── CPU + RAM                  │
│  └── Local SSD                  │
└─────────────────────────────────┘
Scale compute = scale storage (expensive)

Then came RA3 nodes and Redshift Serverless, which separated compute and storage:

Modern Redshift (RA3 / Serverless):
┌─────────────────────────────────┐
│ Compute Layer (scales independently)│
│  ├── Leader Node                │
│  ├── Compute Node 1 (hot cache) │
│  ├── Compute Node 2 (hot cache) │
│  └── Compute Node 3 (hot cache) │
├─────────────────────────────────┤
│ Storage Layer: S3 Managed Storage│
│  (pay only for data stored)     │
└─────────────────────────────────┘

Key Features

PostgreSQL-compatible SQL: most PostgreSQL clients and tools work with Redshift.
Distribution keys: you choose how data is distributed across nodes (KEY, EVEN, ALL). Good distribution key = queries don't shuffle data between nodes.
Sort keys: like ClickHouse's ORDER BY. Critical for zone map effectiveness.
Spectrum: query data directly in S3 (Parquet, CSV, JSON) without loading it into Redshift. Turns your data lake into a queryable layer.
Concurrency Scaling: automatically spins up additional clusters for burst query load.

When Redshift

Deep in the AWS ecosystem (S3, Glue, Kinesis, SageMaker)
Team knows PostgreSQL (Redshift SQL is familiar)
Need Spectrum for data lake queries
Predictable workloads where provisioned capacity makes sense

When Not Redshift

Multi-cloud (Redshift is AWS-only)
Truly serverless (Redshift Serverless exists but pricing can surprise you)
Real-time ingest (Redshift is batch-oriented; Streaming Ingestion exists but ClickHouse is better here)

BigQuery: Google's Serverless Bet

BigQuery is Google's fully managed, serverless data warehouse. There are no clusters to provision, no nodes to configure, no disks to manage. You upload data and run SQL. Google handles everything else.

Architecture: Dremel Under the Hood

BigQuery is built on Google's internal Dremel engine (published in a 2010 paper). The key innovation: separation of storage and compute at Google scale.

BigQuery Architecture:
┌────────────────────────────────────┐
│ Query Engine (Dremel)              │
│  ├── Root server (query planning)  │
│  ├── Mixer nodes (aggregation)     │
│  └── Leaf nodes (column scanning)  │
│  Thousands of nodes, auto-scaled   │
├────────────────────────────────────┤
│ Storage (Colossus / GCS)           │
│  Columnar format (Capacitor)       │
│  Petabytes. You never manage this. │
└────────────────────────────────────┘

Compute is ephemeral — allocated per query.
Storage is persistent — always available.

Pricing Model: Slots vs On-Demand

This is where BigQuery gets interesting (and confusing).

On-demand: $5 per TB scanned. No upfront cost. Great for occasional queries. Terrible if you scan 100 TB daily.
Slots (capacity pricing): buy a fixed number of compute "slots" (vCPUs). Predictable cost. Better for heavy, consistent workloads.

On-demand example:
  Query scans 500 GB → costs $2.50
  100 queries/day × 500 GB → $250/day → $7,500/month

Slots example:
  100 slots → ~$2,000/month (flat rate)
  Run unlimited queries on those slots
  → Cheaper if you run lots of queries

Spicy opinion: BigQuery's on-demand pricing is a trap for production analytics. It's amazing for exploration and one-off queries. But the moment your data scientists start scheduling dashboards, costs spike unpredictably. Switch to slots for production workloads.

Key Features

Zero infrastructure: no clusters, no nodes, no disks. Just SQL.
Automatic scaling: Google allocates compute dynamically per query.
Federated queries: query data in Cloud SQL, Sheets, Drive, Bigtable without moving it.
ML integration: CREATE MODEL — train ML models in SQL directly in BigQuery.
Time travel: query data as it existed at any point in the last 7 days.
BI Engine: in-memory analysis layer for sub-second dashboard queries.

When BigQuery

Google Cloud native
Want zero infrastructure management
Occasional large analytical queries (on-demand pricing shines)
Federated queries across multiple data sources
Data science teams who want SQL-based ML

When Not BigQuery

AWS or Azure shop (use Redshift or Synapse)
Real-time analytics with sub-second latency (BigQuery has seconds of overhead per query)
Need to control costs precisely (on-demand pricing is unpredictable)

Snowflake: The Cloud-Agnostic Diplomat

Snowflake was founded in 2012 with a radical idea: build a data warehouse from scratch for the cloud, with no legacy baggage, and make it work on every cloud provider.

Architecture: Virtual Warehouses

Snowflake's architecture separates three layers:

┌─────────────────────────────────────┐
│ Cloud Services Layer                │
│  Query optimization, metadata,     │
│  access control, transaction mgmt  │
├─────────────────────────────────────┤
│ Compute Layer: Virtual Warehouses   │
│  ┌─────────┐ ┌─────────┐          │
│  │  WH-1   │ │  WH-2   │          │
│  │ (XS, 1  │ │ (L, 8   │          │
│  │  node)  │ │  nodes) │          │
│  └─────────┘ └─────────┘          │
│  Independently scalable             │
│  Pause when idle (stop billing)     │
├─────────────────────────────────────┤
│ Storage Layer (S3 / GCS / Azure Blob)│
│  Micro-partitions (columnar,       │
│  compressed, immutable)            │
└─────────────────────────────────────┘

Virtual warehouses are named compute clusters. You can have wh_etl for loading data, wh_analysts for ad-hoc queries, and wh_dashboards for BI tools — each with independent sizing, concurrency, and billing. One team's heavy query doesn't slow down another team.

Key Features

Multi-cloud: runs on AWS, GCP, and Azure. Same SQL, same features.
Time travel: query historical data (up to 90 days). SELECT * FROM orders AT(TIMESTAMP => '2024-01-15 10:00:00').
Zero-copy cloning: clone a table or entire database instantly for testing. No data copied.
Data sharing: share live data with external organizations without copying.
Automatic clustering: Snowflake re-sorts data automatically based on query patterns. You don't manage sort keys.
Pause/resume: virtual warehouses stop billing when idle. No queries = no compute cost.

When Snowflake

Multi-cloud strategy (avoid AWS/GCP lock-in)
Need compute isolation between teams
Data sharing with partners or customers
Want auto-suspend/resume to save costs on bursty workloads
Teams that don't want to manage sort keys or distribution strategies

When Not Snowflake

Cost-sensitive (Snowflake is the most expensive option on this list)
Real-time analytics (Snowflake is batch-oriented)
Need full control over infrastructure (it's fully managed, you can't tune internals)

The Comparison Table

Feature	ClickHouse	Redshift	BigQuery	Snowflake
Hosting	Self-hosted + Cloud	AWS managed	GCP serverless	Multi-cloud managed
Pricing	Infrastructure cost	Per-node / serverless	Per-TB scanned or slots	Per-credit (compute time)
Real-time ingest	Excellent (millions/s)	Moderate (Streaming Ingestion)	Good (Streaming API)	Moderate (Snowpipe)
Query latency	Sub-second (tuned)	Seconds	Seconds	Seconds
Concurrency	Limited (~100)	Moderate (Concurrency Scaling)	High (auto-scaled)	High (multi-warehouse)
SQL compatibility	ClickHouse SQL	PostgreSQL-like	Standard SQL	Standard SQL
Ecosystem	Kafka, Grafana	AWS (S3, Glue, SageMaker)	GCP (Dataflow, Vertex AI)	Cloud-agnostic
Operational burden	High (self-managed)	Medium	None	None
Updates/deletes	Limited (async)	Supported (slow)	DML supported	DML supported
Best for	Real-time analytics	AWS-native analytics	Serverless analytics	Multi-cloud analytics

Patterns for System Design Interviews

Pattern 1: "We need analytics/reporting." Say "OLAP database like ClickHouse or Redshift." Don't bikeshed on which one. The interviewer cares that you separated OLTP and OLAP, not which specific tool you chose.

Pattern 2: "Real-time dashboard showing last 5 minutes of data." Mention ClickHouse or Druid. They handle real-time ingest and sub-second queries. Redshift and BigQuery have seconds of overhead that makes sub-second dashboards difficult.

Pattern 3: "We need to analyze historical data for trends." Any of these tools works. Mention batch loading from your OLTP database via CDC or ETL. Historical analysis is the bread-and-butter use case for all four.

Pattern 4: "Multiple teams need independent analytics workloads." Snowflake's virtual warehouses or BigQuery's slot reservations. Both provide compute isolation so one team's heavy query doesn't impact another.

Pattern 5: "Cost is a primary concern." ClickHouse (self-hosted) for teams that can operate it. BigQuery on-demand for light usage. Redshift reserved instances for steady workloads. Snowflake is rarely the cheapest option.

Trade-offs Table

Trade-off	Winners	Losers
Lowest cost	ClickHouse (self-hosted)	Snowflake
Least ops burden	BigQuery	ClickHouse
Fastest real-time	ClickHouse	Redshift, Snowflake
Best multi-cloud	Snowflake	Redshift (AWS-only)
Best ecosystem	Redshift (AWS), BigQuery (GCP)	ClickHouse (DIY)
Best concurrency	BigQuery, Snowflake	ClickHouse
Best SQL compat	Redshift (PostgreSQL), BigQuery	ClickHouse (custom SQL)

OLAP Comparison

Interview Gotchas

Gotcha 1: "Which OLAP database should I use in my design?" Don't pick one unless asked. Say "column store like ClickHouse or Redshift." The architectural decision (separate OLTP and OLAP) matters more than the specific tool.

Gotcha 2: "Can ClickHouse replace PostgreSQL?" No. ClickHouse is terrible for OLTP workloads — it doesn't support efficient point lookups, single-row updates, or transactions. It's append-optimized. Use PostgreSQL for OLTP, ClickHouse for OLAP.

Gotcha 3: "Is BigQuery expensive?" Depends on how you use it. A single full-scan query on a 10 TB table costs $50 (on-demand). If you run that query hourly, you're paying $1,200/day. Partition your tables, use clustering, and switch to slots for production workloads.

Gotcha 4: "Why does Snowflake cost more than Redshift?" Snowflake charges per-second for compute (credits) plus storage. The credit rate is higher than equivalent Redshift node costs. You're paying for multi-cloud, zero-copy cloning, time travel, and operational simplicity. Whether that premium is worth it depends on your team and workload.

Gotcha 5: "What about Apache Druid or Apache Pinot?" They're real-time OLAP engines designed for sub-second queries on streaming data. LinkedIn built Pinot, and Metamarkets (now Snap) built Druid. Use them when you need real-time dashboards with pre-aggregation. They're more specialized than the four databases we covered — optimized for a narrower set of access patterns.