Time-Series Databases — Storage, Compression, and Downsampling

TL;DR

Time-series data is write-heavy, append-only, and read in time ranges. Gorilla compression (delta-of-delta for timestamps, XOR for values) achieves 12x compression, averaging 1.37 bytes per data point. Downsampling converts raw data into progressively coarser aggregates (raw -> 5 min -> 1 hour -> 1 day) to keep old data queryable without drowning in storage. InfluxDB, TimescaleDB, and Prometheus each take a different architectural approach. The key interview insight: use a TSDB when cardinality is low and writes are time-ordered; use an OLAP engine when cardinality is high and queries are ad-hoc.

The Problem

You monitor 10,000 servers. Each server emits 200 metrics (CPU, memory, disk, network, etc.) every 10 seconds. That is 2 million data points per 10 seconds, or 17.3 billion per day. Each data point is a (timestamp, value) pair: 16 bytes raw. That is 277 GB per day of raw data.

You need to: - Write all of it with sub-second latency. - Query any time range for any metric in under 100ms. - Keep raw data for 30 days, hourly aggregates for 1 year, daily aggregates for 5 years. - Handle queries like "show me CPU usage for server X over the last 24 hours" and "what was the 99th percentile latency across all servers last month."

A general-purpose database (PostgreSQL, MySQL) can handle this at small scale. At monitoring scale, it collapses. The write throughput is too high for B-tree updates. The data volume is too large for full retention. The query pattern (scan a time range for one metric) does not match the row-based storage layout. Time-series databases are purpose-built for this workload.

The Algorithm: Gorilla Compression

The Insight

Time-series data has two strong patterns: 1. Timestamps arrive at regular intervals. If you sample every 10 seconds, consecutive timestamps differ by exactly 10 seconds, with occasional jitter. 2. Values change slowly. CPU usage at time T and T+10s are usually similar. The floating-point representations differ by only a few bits.

Gorilla (Facebook, 2015) exploits both patterns.

Timestamp Compression: Delta-of-Delta

Raw timestamps:     1000, 1010, 1020, 1030, 1041, 1051
Deltas (Δ):               10,   10,   10,   11,   10
Delta-of-delta (ΔΔ):             0,    0,    1,   -1

Most ΔΔ values are 0 (timestamps arrive at exact intervals). Encode 0 with a single bit (0). Encode small deviations with a few bits. Full timestamps are rare and encoded with more bits.

Encoding scheme:
  ΔΔ = 0:                    '0'                   (1 bit)
  ΔΔ in [-63, 64]:           '10' + 7 bits          (9 bits)
  ΔΔ in [-255, 256]:         '110' + 9 bits          (12 bits)
  ΔΔ in [-2047, 2048]:       '1110' + 12 bits         (16 bits)
  Otherwise:                 '1111' + 32 bits         (36 bits)

For typical metrics with regular sampling, 96% of timestamps compress to 1 bit each.

Value Compression: XOR Encoding

Raw values (IEEE 754 double):
  v1 = 72.5    → 0x4052200000000000
  v2 = 72.6    → 0x4052266666666666
  v3 = 72.6    → 0x4052266666666666

XOR:
  v1 XOR v2 = 0x0000066666666666  (few bits differ)
  v2 XOR v3 = 0x0000000000000000  (identical)

If XOR is 0 (value unchanged), encode with a single bit (0). Otherwise, encode the position and length of the changed bits. For slowly changing metrics, most XORs have very few set bits.

Encoding scheme:
  XOR = 0:                '0'                             (1 bit)
  Leading/trailing zeros match previous:
                          '10' + meaningful bits            (2 + k bits)
  Different bit positions:
                          '11' + 5-bit leading + 6-bit length + meaningful bits

Combined Result

Facebook reported that Gorilla compression achieves 1.37 bytes per data point on average, compared to 16 bytes for raw (timestamp + double). That is a 12x compression ratio. This means the 277 GB/day raw data compresses to ~23 GB/day.

The compression is block-based: a block covers a fixed time window (2 hours in Gorilla). The first data point is stored raw. All subsequent points use delta-of-delta and XOR encoding relative to the previous point. Blocks are immutable once sealed.

Storage Architecture

InfluxDB (TSM Engine)

InfluxDB uses a custom storage engine called TSM (Time Structured Merge tree), which is an LSM variant optimized for time-series:

Write path:
  1. Write to WAL (write-ahead log).
  2. Write to in-memory cache (sorted by time).
  3. When cache is full, flush to TSM file on disk.

TSM file structure:
  - Data sorted by (measurement, tags, time).
  - Each series has contiguous blocks of timestamps + values.
  - Blocks are compressed (Gorilla-style for floats, run-length for ints).
  - Index at end of file maps series → block offsets.

InfluxDB's data model:

Measurement: cpu_usage
Tags: {host: "web-01", region: "us-east"}  ← indexed, low cardinality
Fields: {value: 72.5}                       ← not indexed, numeric
Timestamp: 2024-01-15T10:00:00Z

High cardinality warning: InfluxDB indexes by tag combinations. If you put user IDs or UUIDs in tags, the index explodes. This is the number-one operational issue with InfluxDB.

TimescaleDB (PostgreSQL Extension)

TimescaleDB extends PostgreSQL with automatic time-based partitioning (hypertables):

CREATE TABLE metrics (
    time    TIMESTAMPTZ NOT NULL,
    host    TEXT,
    cpu     DOUBLE PRECISION
);

SELECT create_hypertable('metrics', 'time');
-- Automatically partitions by time intervals (default: 7 days per chunk)

Under the hood, each "chunk" is a standard PostgreSQL table. This means: - Full SQL support, JOINs, CTEs, window functions. - Standard PostgreSQL B-tree indices within each chunk. - VACUUM and ANALYZE work per chunk. - Drop old data by dropping chunks (DROP CHUNKS older than '30 days').

The SQL advantage: If your queries are complex (JOIN metrics with a configuration table, compute percentiles with window functions, correlate across different metric types), TimescaleDB lets you use SQL. InfluxDB and Prometheus have custom query languages that cannot do JOINs.

Prometheus TSDB

Prometheus scrapes targets at fixed intervals (default 15 seconds) and stores locally:

Storage structure:
  data/
  ├── 01BKGV7JBM69T2G1BGBGM6KB12/  ← 2-hour block
  │   ├── chunks/
  │   │   └── 000001                 ← compressed data chunks
  │   ├── index                       ← inverted index: label → series
  │   └── meta.json
  ├── 01BKGTZQ1SYQJTR4PB43C8PD98/  ← another 2-hour block
  └── wal/                            ← write-ahead log for current block

Write path:
  1. Append to in-memory head block + WAL.
  2. Every 2 hours, seal head block → on-disk block.
  3. Compaction: merge small blocks into larger blocks.

Pull model: Prometheus scrapes targets (HTTP GET to /metrics). This is fundamentally different from InfluxDB and TimescaleDB which receive pushes. The pull model means Prometheus controls the sampling rate, can detect target failures (scrape returns error), and does not need authentication on the write path.

Local storage only: Prometheus stores data on a single machine's disk. For long-term storage, use Thanos or Cortex, which upload Prometheus blocks to object storage (S3) and provide a global query layer.

Downsampling and Retention

Downsampling is the practice of pre-computing lower-resolution aggregates from high-resolution data:

Raw data: 1 point per 10 seconds
  → 5-minute aggregate: min, max, avg, count per 5 min
  → 1-hour aggregate: min, max, avg, count per hour
  → 1-day aggregate: min, max, avg, count per day

Storage:
  Raw (10s):    keep 7 days     → 60,480 points/metric/week
  5-min:        keep 30 days    → 8,640 points/metric/month
  1-hour:       keep 1 year     → 8,760 points/metric/year
  1-day:        keep 5 years    → 1,825 points/metric/5 years

Retention policies automatically delete data older than the configured retention period. This is what makes long-term monitoring feasible: you do not keep 5 years of 10-second data, but you keep enough granularity to answer historical questions.

Common aggregates: min, max, avg, count, sum, percentiles (p50, p95, p99). For monitoring, avg is usually insufficient -- a spike to 100% CPU averaged with 0% looks like 50%, hiding the problem. Always keep min and max, or better, keep percentiles.

Proof/Correctness Intuition

Why Delta-of-Delta Works

For a metric sampled at a fixed interval, timestamps form an arithmetic sequence: T, T+d, T+2d, T+3d... Deltas are all d. Delta-of-deltas are all 0. This reduces a 64-bit timestamp to a single bit.

Real-world jitter (network delay, scheduling variance) causes occasional non-zero delta-of-deltas, but they are small (usually within a few seconds). The variable-length encoding handles these efficiently.

Why XOR Works for Floating-Point Values

IEEE 754 floating-point numbers represent nearby values with similar bit patterns. The exponent bits are often identical for values in the same range. The mantissa bits differ only in the lower positions. XOR between consecutive similar values produces a result with many leading and trailing zeros, which compresses well.

This breaks down for highly volatile data (random numbers, hash values). For such data, XOR compression degrades to near-zero compression. In practice, metrics are not random -- they represent physical quantities that change gradually.

Real-World Usage

System	Architecture	Best For
InfluxDB	Custom TSM engine	IoT, standalone metric collection
TimescaleDB	PostgreSQL extension	Complex queries, SQL, JOINs
Prometheus	Local TSDB + pull model	Kubernetes monitoring, alerting
Thanos/Cortex	Prometheus + object storage	Long-term Prometheus storage
QuestDB	Column-oriented, memory-mapped	Ultra-low-latency ingestion
ClickHouse	Columnar OLAP	High-cardinality analytics

TSDB vs OLAP decision: If your cardinality is low (hundreds of metric names, thousands of tag combinations), a TSDB is the right choice. If your cardinality is high (millions of user IDs, billions of event types), an OLAP engine (ClickHouse, Druid) is better. TSDBs optimize for "all values of metric X over time range Y." OLAP engines optimize for "count of events with arbitrary filter combinations."

Interview Application

When to mention TSDBs:

"Design a monitoring/alerting system." -- Prometheus for scraping + alerting, Thanos for long-term storage.
"Design a metrics dashboard." -- TimescaleDB if SQL JOINs needed, InfluxDB if standalone.
"How do you store IoT sensor data?" -- InfluxDB or TimescaleDB with downsampling.
"How do you keep 5 years of monitoring data?" -- Downsampling + retention policies.

What interviewers want to hear:

You understand why general-purpose databases are insufficient for time-series at scale.
You know Gorilla compression (delta-of-delta + XOR) and can explain why it works.
You understand downsampling and retention policies.
You can articulate the TSDB vs OLAP decision.

Trade-offs

Downsampling Tiers

Aspect	InfluxDB	TimescaleDB	Prometheus
Query language	InfluxQL / Flux	Full SQL	PromQL
Write model	Push	Push	Pull (scrape)
Cardinality limit	Low-medium	Medium-high	Low-medium
JOINs	No	Yes (full SQL)	No
Long-term storage	Built-in	PostgreSQL tooling	Thanos/Cortex
Operational cost	Standalone binary	PostgreSQL ops	Simple but limited
Clustering	Enterprise only	Multi-node (paid)	Thanos/Cortex

Common Mistakes

Just use PostgreSQL with a timestamp column

It works at small scale. At monitoring scale (millions of writes/second), B-tree index updates become a bottleneck. Range scans across months of data are slow because row-based storage interleaves unrelated columns. TimescaleDB solves this by adding time partitioning to PostgreSQL.

Store everything at the highest resolution forever

10-second resolution for 5 years: 15.8 million data points per metric. At 1000 metrics, that is 15.8 billion data points. Even compressed, this is terabytes. Downsampling is not optional at scale -- it is the only way to keep long-term data queryable.

Averages are sufficient for downsampled data

An average hides spikes and drops. If CPU usage was 100% for 10 seconds and 0% for 290 seconds, the 5-minute average is 3.3% -- masking a critical spike. Always keep min, max, and count alongside averages. Percentiles are even better but more expensive to store.

High cardinality is fine in InfluxDB

InfluxDB indexes tag combinations. If you store user_id as a tag, each user creates a new series. 10 million users * 100 metrics = 1 billion series. The index size explodes, query performance degrades, and the system becomes unstable. Put high-cardinality fields in the field set (not indexed) or use an OLAP engine.

Prometheus can store data long-term

Prometheus local storage has a default retention of 15 days. It is designed for recent, operational data. For long-term storage, use Thanos (sidecar uploads blocks to S3, provides global query layer) or Cortex (horizontally scalable, uses object storage).

Gorilla compression works on all data

Gorilla compression exploits temporal regularity (regular intervals) and value stability (slow changes). For irregular timestamps or rapidly changing values, compression ratios degrade significantly. If your data is event-driven (user clicks, API calls) rather than periodic samples, Gorilla compression provides limited benefit.