ETL, ELT, and the Modern Data Stack
TL;DR
ETL transforms data before loading it into the warehouse; ELT loads raw data first and transforms inside the warehouse — and ELT is winning because modern warehouses are cheap enough and fast enough to do the heavy lifting themselves.
The Core Problem

Your production databases are optimized for serving users. Your analytics warehouse is optimized for answering business questions. Data needs to flow from one to the other.
That flow has three steps:
- Extract: pull data from source systems (PostgreSQL, MongoDB, APIs, logs)
- Transform: clean, reshape, enrich, and aggregate the data
- Load: put the data into the analytics warehouse
The question is: when do you transform? Before loading (ETL) or after (ELT)?
This isn't an academic distinction. It determines your tooling, your costs, your latency, and how many engineers you need.
ETL: The Traditional Approach
Extract → Transform → Load.
You pull raw data from sources, transform it in a separate compute layer (Spark, Informatica, custom Python scripts), then load the clean data into the warehouse.
How this looked in 2010:
- Nightly cron job runs at 2 AM
- SQL queries extract yesterday's data from PostgreSQL
- Spark job or Python script cleans nulls, deduplicates, joins reference data
- Clean data loaded into Oracle/Teradata warehouse
- Analysts come in at 9 AM and query yesterday's data
This worked. For decades. Major banks, retailers, and telecoms ran their entire analytics on nightly ETL. Some still do.
The problem: the transform step is a bottleneck. Every new data source needs custom transform code. Every schema change in the source breaks the transform. The ETL engineers become a gatekeeping team that every analyst depends on. Nothing moves fast.
Spotify's early data platform used custom ETL jobs in Luigi (their workflow manager). As they scaled, the transform layer became the slowest-moving part of their data infrastructure.
ELT: The Modern Approach
Extract → Load → Transform.
You pull raw data from sources, load it directly into the warehouse as-is, then transform it using SQL inside the warehouse.
Why this works now: warehouses got cheap and fast. In 2010, warehouse compute was expensive — you wanted to minimize what went in. In 2024, BigQuery charges $5/TB scanned, Redshift scales elastically, and Snowflake auto-suspends unused compute. Loading raw data and transforming it later costs pennies compared to maintaining a separate Spark cluster.
Spicy opinion: ETL was a workaround for expensive warehouses. Now that warehouses are cheap, the workaround is unnecessary. ELT is simpler, faster to develop, and easier to debug. If you're starting a new data stack in 2024, default to ELT unless you have a specific reason not to.
dbt: The Tool That Made ELT Practical
dbt (data build tool) is the reason ELT won. It turns warehouse transformations into a software engineering workflow.
Before dbt, transformations were stored procedures, scheduled SQL scripts, and tribal knowledge. Nobody tested them. Nobody version-controlled them. Nobody documented them.
dbt changes this:
-- models/staging/stg_orders.sql
-- This is a dbt model. It's just SQL with Jinja templating.
SELECT
id AS order_id,
user_id,
CAST(total_cents AS DECIMAL(10,2)) / 100 AS total_amount,
status,
created_at
FROM {{ source('production', 'orders') }}
WHERE status != 'cancelled' -- filter junk early
-- models/marts/revenue_by_country.sql
SELECT
u.country,
DATE_TRUNC('month', o.created_at) AS month,
SUM(o.total_amount) AS revenue,
COUNT(DISTINCT o.user_id) AS customers
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_users') }} u ON o.user_id = u.user_id
GROUP BY 1, 2
What dbt gives you:
- Version control: models are SQL files in Git. PR reviews for data transformations.
- Testing:
dbt testchecks for nulls, uniqueness, referential integrity. - Documentation: auto-generated docs and lineage graphs.
- Dependency management:
{{ ref('stg_orders') }}creates a DAG. dbt runs models in the right order. - Incremental models: only process new/changed rows instead of full table rebuilds.
# tests/schema.yml
models:
- name: stg_orders
columns:
- name: order_id
tests:
- not_null
- unique
- name: total_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 100000
GitLab, JetBlue, and Hubspot all use dbt in production. It's become the default tool for the "T" in ELT.
The Data Pipeline: End to End
Here's what a modern data pipeline looks like.
Batch Pipeline
┌──────────┐ ┌──────────────┐ ┌───────────┐ ┌───────────┐
│ Source DB │───▶│ Airbyte / │───▶│ Warehouse │───▶│ dbt │
│ (Postgres)│ │ Fivetran │ │ (raw data)│ │ (transform│
└──────────┘ │ (extract+load)│ └───────────┘ │ in SQL) │
└──────────────┘ └─────┬─────┘
│
┌───────▼───────┐
│ Clean tables │
│ (marts) │
│ → BI tools │
│ → Dashboards │
└───────────────┘
- Airbyte (open-source) or Fivetran (managed) handle extraction from 300+ sources. They pull data, handle schema changes, and load raw data into the warehouse.
- dbt transforms the raw data into clean, tested, documented tables.
- BI tools (Looker, Metabase, Tableau) query the clean tables.
Frequency: hourly or daily. Most analytics don't need real-time data.
Streaming Pipeline
┌──────────┐ ┌──────────┐ ┌───────┐ ┌───────────┐
│ Source DB │───▶│ Debezium │───▶│ Kafka │───▶│ ClickHouse│
│ (Postgres)│ │ (CDC) │ │ │ │ (real-time│
└──────────┘ └──────────┘ └───────┘ │ OLAP) │
└───────────┘
- Debezium captures every INSERT, UPDATE, and DELETE from the source database's transaction log (WAL in PostgreSQL, binlog in MySQL). This is Change Data Capture (CDC).
- Kafka buffers the changes and provides ordering guarantees.
- ClickHouse (or Druid/Pinot) ingests from Kafka and makes data queryable within seconds.
Use this when you need near-real-time analytics: fraud detection dashboards, live metrics, real-time recommendation updates.
CDC: Change Data Capture
CDC deserves special attention because it's the best way to get data out of a production database without hurting it.
Old approach: query the production database directly.
-- Every hour, run this against production PostgreSQL:
SELECT * FROM orders WHERE updated_at > :last_run_time;
Problems: adds load to the production database, might miss updates between runs (if updated_at isn't indexed or transactions are long), and doesn't capture deletes.
CDC approach: read the database's transaction log.
PostgreSQL WAL → Debezium → Kafka → Warehouse
Debezium reads the WAL (Write-Ahead Log). Every committed
change is captured — inserts, updates, AND deletes.
Zero impact on production query performance.
Debezium is the standard open-source CDC tool. It supports PostgreSQL, MySQL, MongoDB, SQL Server, and Oracle. Managed alternatives include AWS DMS, Fivetran's CDC connectors, and Airbyte's CDC mode.
Batch vs Streaming: When Does It Matter?
Spicy opinion: 90% of analytics workloads are fine with hourly or daily batch. Teams that build streaming pipelines "because real-time sounds better" end up with more complexity, more failure modes, and data that's 30 seconds fresher than batch — which nobody actually notices.
| Scenario | Batch or Streaming? | Why |
|---|---|---|
| Executive dashboards | Batch (daily) | Nobody refreshes at midnight |
| Product analytics | Batch (hourly) | "What happened today" is fine |
| Fraud detection | Streaming | Need to block transactions in real time |
| Ad click attribution | Streaming | 30-minute window for attribution |
| Financial reporting | Batch (daily) | Regulated, needs reconciliation |
| Live monitoring | Streaming | Alert on anomalies immediately |
| ML feature store | Depends | Training = batch. Serving = streaming |
Decision rule: if a human is the consumer, batch is fine. If a machine is the consumer (fraud model, recommendation engine), streaming might be necessary.
The Data Lake
A data lake is just object storage (S3, GCS, Azure Blob) holding files in columnar formats like Parquet or ORC. It's the cheapest way to store large datasets.
Data Lake (S3):
s3://data-lake/raw/orders/2024/01/15/part-00001.parquet
s3://data-lake/raw/orders/2024/01/15/part-00002.parquet
s3://data-lake/raw/events/2024/01/15/hour=00/part-00001.parquet
s3://data-lake/raw/events/2024/01/15/hour=01/part-00001.parquet
Query engines: Athena (AWS), Presto/Trino, Spark SQL can query Parquet files in S3 directly. No warehouse needed for occasional queries.
-- AWS Athena: query S3 directly
SELECT country, COUNT(*) FROM "s3://data-lake/raw/orders/"
WHERE year = 2024 AND month = 1
GROUP BY country;
When data lake vs warehouse:
| Aspect | Data Lake (S3) | Data Warehouse |
|---|---|---|
| Cost | $0.023/GB/month (S3) | $0.25-2/GB/month |
| Query speed | Seconds to minutes | Sub-second to seconds |
| Data types | Any (Parquet, JSON, CSV, images) | Structured (tables) |
| Schema | Schema-on-read | Schema-on-write |
| Governance | DIY | Built-in (permissions, auditing) |
Store everything in the lake. Load only what you need into the warehouse. This two-tier approach saves money and keeps your warehouse focused on clean, queryable data.
The Lakehouse: Best of Both Worlds
The lakehouse pattern adds ACID transactions and table management to a data lake. Instead of choosing between lake and warehouse, you get both.
Three competing formats:
Delta Lake (Databricks)
s3://data-lake/orders/
├── _delta_log/ ← transaction log
│ ├── 00000.json ← version 0: created table
│ ├── 00001.json ← version 1: inserted 1M rows
│ └── 00002.json ← version 2: updated 500 rows
├── part-00001.parquet ← actual data files
└── part-00002.parquet
Delta Lake stores a transaction log alongside the Parquet files. This enables ACID transactions, time travel, and MERGE operations — things Parquet files alone can't do.
Apache Iceberg (Netflix)
Netflix built Iceberg because they needed reliable table operations on petabyte-scale datasets in S3. Iceberg tracks every file in a metadata layer, enabling:
- Hidden partitioning: partition by month without changing query syntax.
- Schema evolution: add, rename, or remove columns without rewriting data.
- Time travel: query any historical snapshot.
Iceberg is gaining momentum fast. Snowflake, Spark, Trino, Flink, and Dremio all support it natively.
Apache Hudi (Uber)
Uber built Hudi for incremental processing — updating individual records in a data lake without rewriting entire partitions. Best for CDC workloads where you need upserts (insert or update).
Spicy opinion: Iceberg is winning the lakehouse format war. It has the broadest ecosystem support, the cleanest spec, and Netflix's backing gives it credibility. Delta Lake is Databricks' play (strong but vendor-tilted), and Hudi is narrower in scope. Bet on Iceberg for new projects.
The Modern Data Stack (End to End)
┌─────────────────────────────────────────────────────┐
│ SOURCES │
│ PostgreSQL, MongoDB, Stripe API, Segment, Logs │
└──────────────────────┬──────────────────────────────┘
│
┌────────▼────────┐
│ EXTRACT + LOAD │
│ Fivetran/Airbyte│
└────────┬────────┘
│
┌────────▼────────┐
│ RAW LAYER │
│ (warehouse) │
│ Untouched data │
└────────┬────────┘
│
┌────────▼────────┐
│ TRANSFORM │
│ dbt │
│ SQL models │
│ Tests │
└────────┬────────┘
│
┌────────▼────────┐
│ MARTS │
│ Clean tables │
│ Business logic │
└────────┬────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ Looker │ │ Notebooks│ │ Reverse │
│ Metabase│ │ (Jupyter)│ │ ETL │
│ Tableau │ │ │ │ (Census, │
└─────────┘ └──────────┘ │ Hightouch)│
└──────────┘
Reverse ETL is the newest layer. Tools like Census and Hightouch push analytics data back into operational tools — Salesforce, HubSpot, Braze. "Your best customers by LTV" computed in the warehouse, synced to Salesforce as a list.
Patterns for System Design Interviews
Pattern 1: Data separation. "OLTP database for transactions, replicate to OLAP warehouse for analytics. Never query production for reports."
Pattern 2: CDC for real-time replication. "Use Debezium to capture changes from the production database's transaction log. Feed through Kafka into the warehouse. Near-real-time, zero impact on production."
Pattern 3: Batch is usually enough. "For this dashboard, hourly batch ETL is fine. We don't need real-time analytics for a report that executives check once a day." Shows you can evaluate trade-offs rather than defaulting to the most complex option.
Pattern 4: Lakehouse for flexibility. "Store raw data in S3 as Parquet with Iceberg metadata. Query directly with Trino for ad-hoc analysis. Load aggregated data into Redshift for dashboards."
Trade-offs Table
| Approach | Latency | Complexity | Cost | Best For |
|---|---|---|---|---|
| Batch ETL (Spark) | Hours | High (custom code) | Medium | Legacy, complex transforms |
| Batch ELT (dbt) | Hours | Low (SQL + Git) | Low | Most analytics workloads |
| CDC → Kafka → OLAP | Seconds | High (ops burden) | Medium-High | Real-time dashboards |
| Managed EL (Fivetran) | Minutes-hours | Lowest | High (per-connector pricing) | Teams without data engineers |
| Data lake (S3 + Athena) | Minutes | Low | Lowest | Ad-hoc exploration |
| Lakehouse (Iceberg) | Seconds-minutes | Medium | Low-Medium | Unified batch + ad-hoc |

Interview Gotchas
Gotcha 1: "Why not just query the production database for analytics?" Analytical queries (full table scans, aggregations) compete with transactional queries for CPU, memory, and I/O. A heavy dashboard refresh can cause timeouts for user-facing requests. Always separate.
Gotcha 2: "When would you use ETL instead of ELT?" When transforms are compute-intensive and the warehouse can't handle them (e.g., ML feature engineering with custom Python, image processing, complex graph algorithms). Or when data must be cleaned before it enters the warehouse for compliance reasons (PII stripping).
Gotcha 3: "What's the difference between CDC and polling?"
Polling runs a SELECT WHERE updated_at > X query against production. It adds load, misses deletes, and has a blind spot between polls. CDC reads the transaction log — zero query load, captures everything, and maintains ordering.
Gotcha 4: "Is real-time analytics always better?" No. Real-time pipelines are more complex, more expensive, and harder to debug. If the consumer is a human looking at a dashboard, hourly refresh is indistinguishable from real-time. Save streaming for machine consumers (fraud detection, recommendation engines).
Gotcha 5: "What is dbt and why does everyone use it?" dbt is a SQL-based transformation framework. It turns warehouse SQL into version-controlled, tested, documented models with dependency management. It's popular because it lets analytics engineers (SQL-proficient, not necessarily Python-proficient) own the entire transform layer without depending on data engineering teams.