ETL, ELT, and the Modern Data Stack

TL;DR

ETL transforms data before loading it into the warehouse; ELT loads raw data first and transforms inside the warehouse — and ELT is winning because modern warehouses are cheap enough and fast enough to do the heavy lifting themselves.

The Core Problem

ETL Vs ELT

Your production databases are optimized for serving users. Your analytics warehouse is optimized for answering business questions. Data needs to flow from one to the other.

That flow has three steps:

Extract: pull data from source systems (PostgreSQL, MongoDB, APIs, logs)
Transform: clean, reshape, enrich, and aggregate the data
Load: put the data into the analytics warehouse

The question is: when do you transform? Before loading (ETL) or after (ELT)?

This isn't an academic distinction. It determines your tooling, your costs, your latency, and how many engineers you need.

ETL: The Traditional Approach

Extract → Transform → Load.

Source DBs ──Extract──▶ ETL Server ──Transform──▶ Load ──▶ Warehouse
                        (Spark, custom code)

You pull raw data from sources, transform it in a separate compute layer (Spark, Informatica, custom Python scripts), then load the clean data into the warehouse.

How this looked in 2010:

Nightly cron job runs at 2 AM
SQL queries extract yesterday's data from PostgreSQL
Spark job or Python script cleans nulls, deduplicates, joins reference data
Clean data loaded into Oracle/Teradata warehouse
Analysts come in at 9 AM and query yesterday's data

This worked. For decades. Major banks, retailers, and telecoms ran their entire analytics on nightly ETL. Some still do.

The problem: the transform step is a bottleneck. Every new data source needs custom transform code. Every schema change in the source breaks the transform. The ETL engineers become a gatekeeping team that every analyst depends on. Nothing moves fast.

Spotify's early data platform used custom ETL jobs in Luigi (their workflow manager). As they scaled, the transform layer became the slowest-moving part of their data infrastructure.

ELT: The Modern Approach

Extract → Load → Transform.

Source DBs ──Extract──▶ Load (raw) ──▶ Warehouse ──Transform──▶ Clean Tables
                                       (dbt, SQL)

You pull raw data from sources, load it directly into the warehouse as-is, then transform it using SQL inside the warehouse.

Why this works now: warehouses got cheap and fast. In 2010, warehouse compute was expensive — you wanted to minimize what went in. In 2024, BigQuery charges $5/TB scanned, Redshift scales elastically, and Snowflake auto-suspends unused compute. Loading raw data and transforming it later costs pennies compared to maintaining a separate Spark cluster.

Spicy opinion: ETL was a workaround for expensive warehouses. Now that warehouses are cheap, the workaround is unnecessary. ELT is simpler, faster to develop, and easier to debug. If you're starting a new data stack in 2024, default to ELT unless you have a specific reason not to.

dbt: The Tool That Made ELT Practical

dbt (data build tool) is the reason ELT won. It turns warehouse transformations into a software engineering workflow.

Before dbt, transformations were stored procedures, scheduled SQL scripts, and tribal knowledge. Nobody tested them. Nobody version-controlled them. Nobody documented them.

dbt changes this:

-- models/staging/stg_orders.sql
-- This is a dbt model. It's just SQL with Jinja templating.

SELECT
    id AS order_id,
    user_id,
    CAST(total_cents AS DECIMAL(10,2)) / 100 AS total_amount,
    status,
    created_at
FROM {{ source('production', 'orders') }}
WHERE status != 'cancelled'  -- filter junk early

-- models/marts/revenue_by_country.sql
SELECT
    u.country,
    DATE_TRUNC('month', o.created_at) AS month,
    SUM(o.total_amount) AS revenue,
    COUNT(DISTINCT o.user_id) AS customers
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_users') }} u ON o.user_id = u.user_id
GROUP BY 1, 2

What dbt gives you:

Version control: models are SQL files in Git. PR reviews for data transformations.
Testing: dbt test checks for nulls, uniqueness, referential integrity.
Documentation: auto-generated docs and lineage graphs.
Dependency management: {{ ref('stg_orders') }} creates a DAG. dbt runs models in the right order.
Incremental models: only process new/changed rows instead of full table rebuilds.

# tests/schema.yml
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000

GitLab, JetBlue, and Hubspot all use dbt in production. It's become the default tool for the "T" in ELT.

The Data Pipeline: End to End

Here's what a modern data pipeline looks like.

Batch Pipeline

┌──────────┐    ┌──────────────┐    ┌───────────┐    ┌───────────┐
│ Source DB │───▶│ Airbyte /    │───▶│ Warehouse │───▶│ dbt       │
│ (Postgres)│    │ Fivetran     │    │ (raw data)│    │ (transform│
└──────────┘    │ (extract+load)│    └───────────┘    │  in SQL)  │
                └──────────────┘                      └─────┬─────┘
                                                            │
                                                    ┌───────▼───────┐
                                                    │ Clean tables  │
                                                    │ (marts)       │
                                                    │ → BI tools    │
                                                    │ → Dashboards  │
                                                    └───────────────┘

Airbyte (open-source) or Fivetran (managed) handle extraction from 300+ sources. They pull data, handle schema changes, and load raw data into the warehouse.
dbt transforms the raw data into clean, tested, documented tables.
BI tools (Looker, Metabase, Tableau) query the clean tables.

Frequency: hourly or daily. Most analytics don't need real-time data.

Streaming Pipeline

┌──────────┐    ┌──────────┐    ┌───────┐    ┌───────────┐
│ Source DB │───▶│ Debezium │───▶│ Kafka │───▶│ ClickHouse│
│ (Postgres)│    │ (CDC)    │    │       │    │ (real-time│
└──────────┘    └──────────┘    └───────┘    │  OLAP)    │
                                              └───────────┘

Debezium captures every INSERT, UPDATE, and DELETE from the source database's transaction log (WAL in PostgreSQL, binlog in MySQL). This is Change Data Capture (CDC).
Kafka buffers the changes and provides ordering guarantees.
ClickHouse (or Druid/Pinot) ingests from Kafka and makes data queryable within seconds.

Use this when you need near-real-time analytics: fraud detection dashboards, live metrics, real-time recommendation updates.

CDC: Change Data Capture

CDC deserves special attention because it's the best way to get data out of a production database without hurting it.

Old approach: query the production database directly.

-- Every hour, run this against production PostgreSQL:
SELECT * FROM orders WHERE updated_at > :last_run_time;

Problems: adds load to the production database, might miss updates between runs (if updated_at isn't indexed or transactions are long), and doesn't capture deletes.

CDC approach: read the database's transaction log.

PostgreSQL WAL → Debezium → Kafka → Warehouse

Debezium reads the WAL (Write-Ahead Log). Every committed
change is captured — inserts, updates, AND deletes.
Zero impact on production query performance.

Debezium is the standard open-source CDC tool. It supports PostgreSQL, MySQL, MongoDB, SQL Server, and Oracle. Managed alternatives include AWS DMS, Fivetran's CDC connectors, and Airbyte's CDC mode.

Batch vs Streaming: When Does It Matter?

Spicy opinion: 90% of analytics workloads are fine with hourly or daily batch. Teams that build streaming pipelines "because real-time sounds better" end up with more complexity, more failure modes, and data that's 30 seconds fresher than batch — which nobody actually notices.

Scenario	Batch or Streaming?	Why
Executive dashboards	Batch (daily)	Nobody refreshes at midnight
Product analytics	Batch (hourly)	"What happened today" is fine
Fraud detection	Streaming	Need to block transactions in real time
Ad click attribution	Streaming	30-minute window for attribution
Financial reporting	Batch (daily)	Regulated, needs reconciliation
Live monitoring	Streaming	Alert on anomalies immediately
ML feature store	Depends	Training = batch. Serving = streaming

Decision rule: if a human is the consumer, batch is fine. If a machine is the consumer (fraud model, recommendation engine), streaming might be necessary.

The Data Lake

A data lake is just object storage (S3, GCS, Azure Blob) holding files in columnar formats like Parquet or ORC. It's the cheapest way to store large datasets.

Data Lake (S3):
  s3://data-lake/raw/orders/2024/01/15/part-00001.parquet
  s3://data-lake/raw/orders/2024/01/15/part-00002.parquet
  s3://data-lake/raw/events/2024/01/15/hour=00/part-00001.parquet
  s3://data-lake/raw/events/2024/01/15/hour=01/part-00001.parquet

Query engines: Athena (AWS), Presto/Trino, Spark SQL can query Parquet files in S3 directly. No warehouse needed for occasional queries.

-- AWS Athena: query S3 directly
SELECT country, COUNT(*) FROM "s3://data-lake/raw/orders/"
WHERE year = 2024 AND month = 1
GROUP BY country;

When data lake vs warehouse:

Aspect	Data Lake (S3)	Data Warehouse
Cost	$0.023/GB/month (S3)	$0.25-2/GB/month
Query speed	Seconds to minutes	Sub-second to seconds
Data types	Any (Parquet, JSON, CSV, images)	Structured (tables)
Schema	Schema-on-read	Schema-on-write
Governance	DIY	Built-in (permissions, auditing)

Store everything in the lake. Load only what you need into the warehouse. This two-tier approach saves money and keeps your warehouse focused on clean, queryable data.

The Lakehouse: Best of Both Worlds

The lakehouse pattern adds ACID transactions and table management to a data lake. Instead of choosing between lake and warehouse, you get both.

Three competing formats:

Delta Lake (Databricks)

s3://data-lake/orders/
  ├── _delta_log/          ← transaction log
  │   ├── 00000.json       ← version 0: created table
  │   ├── 00001.json       ← version 1: inserted 1M rows
  │   └── 00002.json       ← version 2: updated 500 rows
  ├── part-00001.parquet   ← actual data files
  └── part-00002.parquet

Delta Lake stores a transaction log alongside the Parquet files. This enables ACID transactions, time travel, and MERGE operations — things Parquet files alone can't do.

Apache Iceberg (Netflix)

Netflix built Iceberg because they needed reliable table operations on petabyte-scale datasets in S3. Iceberg tracks every file in a metadata layer, enabling:

Hidden partitioning: partition by month without changing query syntax.
Schema evolution: add, rename, or remove columns without rewriting data.
Time travel: query any historical snapshot.

Iceberg is gaining momentum fast. Snowflake, Spark, Trino, Flink, and Dremio all support it natively.

Apache Hudi (Uber)

Uber built Hudi for incremental processing — updating individual records in a data lake without rewriting entire partitions. Best for CDC workloads where you need upserts (insert or update).

Spicy opinion: Iceberg is winning the lakehouse format war. It has the broadest ecosystem support, the cleanest spec, and Netflix's backing gives it credibility. Delta Lake is Databricks' play (strong but vendor-tilted), and Hudi is narrower in scope. Bet on Iceberg for new projects.

The Modern Data Stack (End to End)

┌─────────────────────────────────────────────────────┐
│                 SOURCES                              │
│  PostgreSQL, MongoDB, Stripe API, Segment, Logs     │
└──────────────────────┬──────────────────────────────┘
                       │
              ┌────────▼────────┐
              │  EXTRACT + LOAD  │
              │  Fivetran/Airbyte│
              └────────┬────────┘
                       │
              ┌────────▼────────┐
              │   RAW LAYER      │
              │  (warehouse)     │
              │  Untouched data  │
              └────────┬────────┘
                       │
              ┌────────▼────────┐
              │   TRANSFORM      │
              │   dbt            │
              │   SQL models     │
              │   Tests          │
              └────────┬────────┘
                       │
              ┌────────▼────────┐
              │   MARTS          │
              │  Clean tables    │
              │  Business logic  │
              └────────┬────────┘
                       │
         ┌─────────────┼─────────────┐
         ▼             ▼             ▼
    ┌─────────┐  ┌──────────┐  ┌──────────┐
    │ Looker  │  │ Notebooks│  │ Reverse  │
    │ Metabase│  │ (Jupyter)│  │ ETL      │
    │ Tableau │  │          │  │ (Census, │
    └─────────┘  └──────────┘  │  Hightouch)│
                               └──────────┘

Reverse ETL is the newest layer. Tools like Census and Hightouch push analytics data back into operational tools — Salesforce, HubSpot, Braze. "Your best customers by LTV" computed in the warehouse, synced to Salesforce as a list.

Patterns for System Design Interviews

Pattern 1: Data separation. "OLTP database for transactions, replicate to OLAP warehouse for analytics. Never query production for reports."

Pattern 2: CDC for real-time replication. "Use Debezium to capture changes from the production database's transaction log. Feed through Kafka into the warehouse. Near-real-time, zero impact on production."

Pattern 3: Batch is usually enough. "For this dashboard, hourly batch ETL is fine. We don't need real-time analytics for a report that executives check once a day." Shows you can evaluate trade-offs rather than defaulting to the most complex option.

Pattern 4: Lakehouse for flexibility. "Store raw data in S3 as Parquet with Iceberg metadata. Query directly with Trino for ad-hoc analysis. Load aggregated data into Redshift for dashboards."

Trade-offs Table

Approach	Latency	Complexity	Cost	Best For
Batch ETL (Spark)	Hours	High (custom code)	Medium	Legacy, complex transforms
Batch ELT (dbt)	Hours	Low (SQL + Git)	Low	Most analytics workloads
CDC → Kafka → OLAP	Seconds	High (ops burden)	Medium-High	Real-time dashboards
Managed EL (Fivetran)	Minutes-hours	Lowest	High (per-connector pricing)	Teams without data engineers
Data lake (S3 + Athena)	Minutes	Low	Lowest	Ad-hoc exploration
Lakehouse (Iceberg)	Seconds-minutes	Medium	Low-Medium	Unified batch + ad-hoc

Modern Data Stack

Interview Gotchas

Gotcha 1: "Why not just query the production database for analytics?" Analytical queries (full table scans, aggregations) compete with transactional queries for CPU, memory, and I/O. A heavy dashboard refresh can cause timeouts for user-facing requests. Always separate.

Gotcha 2: "When would you use ETL instead of ELT?" When transforms are compute-intensive and the warehouse can't handle them (e.g., ML feature engineering with custom Python, image processing, complex graph algorithms). Or when data must be cleaned before it enters the warehouse for compliance reasons (PII stripping).

Gotcha 3: "What's the difference between CDC and polling?" Polling runs a SELECT WHERE updated_at > X query against production. It adds load, misses deletes, and has a blind spot between polls. CDC reads the transaction log — zero query load, captures everything, and maintains ordering.

Gotcha 4: "Is real-time analytics always better?" No. Real-time pipelines are more complex, more expensive, and harder to debug. If the consumer is a human looking at a dashboard, hourly refresh is indistinguishable from real-time. Save streaming for machine consumers (fraud detection, recommendation engines).

Gotcha 5: "What is dbt and why does everyone use it?" dbt is a SQL-based transformation framework. It turns warehouse SQL into version-controlled, tested, documented models with dependency management. It's popular because it lets analytics engineers (SQL-proficient, not necessarily Python-proficient) own the entire transform layer without depending on data engineering teams.