Distributed File Systems — GFS and HDFS

TL;DR

GFS (Google, 2003) and HDFS (Apache, 2006) solve the same problem: store petabytes of data across thousands of commodity machines with automatic fault tolerance. Architecture: a single master (GFS Master / HDFS NameNode) holds all metadata in memory, while data lives on chunkservers/DataNodes in large chunks (64-128 MB), each replicated 3 times. Writes go through a pipeline (chain replication). Reads go to the nearest replica. The single master is simple but is a bottleneck -- HDFS Federation and Erasure Coding are the solutions. Every big data system (MapReduce, Spark, Hive) assumes HDFS underneath. Understanding GFS/HDFS is understanding the storage layer of the data infrastructure.

The Problem

Google in 2003 had thousands of commodity servers. Disks failed daily. They needed to store the entire web crawl (petabytes) and process it with MapReduce. Requirements:

Store petabytes across thousands of machines.
Tolerate hardware failures automatically (disks die, machines crash, network goes down).
High throughput for large sequential reads and writes (not random I/O).
Large files: web crawl data, log files, index data. Typical file: hundreds of MB to multi-GB.
Append-heavy: most writes append data, rarely modify existing data.

A traditional file system (ext4, NTFS) runs on a single machine. A NAS/SAN provides shared storage but does not scale to thousands of machines. GFS was purpose-built for this workload.

The Algorithm: GFS Architecture

Components

┌──────────────┐
│  GFS Master  │  (single server, all metadata in memory)
│              │  - file namespace (directory tree)
│              │  - file → chunk mapping
│              │  - chunk → chunkserver mapping
│              │  - chunk lease management
└──────┬───────┘
       │ metadata operations
       │
┌──────┴───────┐
│   Clients    │
│              │
└──┬────┬───┬──┘
   │    │   │  data operations (direct to chunkservers)
   │    │   │
┌──┴─┐┌─┴──┐┌┴───┐
│ CS1 ││ CS2 ││ CS3 │  ... hundreds of chunkservers
└────┘└────┘└────┘

Chunks

Files are divided into fixed-size chunks (64 MB in GFS, 128 MB in HDFS). Each chunk is stored as a regular file on the chunkserver's local file system. Each chunk is replicated to 3 chunkservers (default replication factor).

Why 64 MB chunks?

Reduces metadata: a 1 TB file has ~16,000 chunks. With 1 MB chunks, it would be ~1 million chunks. Less metadata = fits in master's memory.
Reduces client-master interaction: one metadata lookup covers 64 MB of data.
Enables large sequential I/O: reading a chunk is one large sequential disk read.
Trade-off: small files still consume one chunk (wastes replication). Many small files create "small file problem" -- too many metadata entries.

Master (NameNode in HDFS)

The master stores three types of metadata, ALL in memory:

Namespace: the directory tree and file names.
File-to-chunk mapping: which chunks make up each file, in order.
Chunk-to-chunkserver mapping: which chunkservers hold each chunk's replicas.

The first two are persisted to disk via an operation log (equivalent to a WAL) and periodic checkpoints. The third is NOT persisted -- at startup, the master asks each chunkserver "what chunks do you have?" This is simpler and avoids the problem of stale mappings when chunkservers join/leave.

Memory footprint: Each chunk requires ~64 bytes of metadata. 1 billion chunks = 64 GB of metadata. This is feasible for a single machine with 128+ GB RAM.

Read Path

1. Client → Master: "I want to read file F, offset O."
2. Master computes: chunk index = O / chunk_size.
3. Master → Client: (chunk_handle, [chunkserver1, chunkserver2, chunkserver3])
   (Client caches this mapping for future reads.)
4. Client → nearest chunkserver: "Give me chunk C, byte range [start, end]."
5. Chunkserver reads from local disk, returns data to client.

The master is NOT in the data path. It only provides metadata. This is critical for scalability: the master handles lightweight metadata requests while chunkservers handle heavy data transfers.

Write Path (Append)

GFS optimizes for appends (not random writes). The append operation:

1. Client → Master: "I want to append to file F."
2. Master identifies the last chunk of F. One replica holds the "lease" (primary).
3. Master → Client: (chunk_handle, primary_chunkserver, secondary_chunkservers)

4. Client pushes data to ALL replicas (primary + secondaries) in a pipeline:
   Client → closest replica → next closest → ... (chain replication style)
   Data is buffered but NOT applied yet.

5. Client → Primary: "Commit the append."
6. Primary assigns a serial number to the append.
7. Primary applies the write to its local chunk.
8. Primary → Secondaries: "Apply write with serial number N."
9. Secondaries apply in serial-number order.
10. Secondaries → Primary: ACK.
11. Primary → Client: success (or failure, with retry).

Why push data before commit? The data push can be pipelined along the chain (each chunkserver starts forwarding as it receives data). The commit is a small control message. This separates the data plane (large, network-bound) from the control plane (small, latency-sensitive).

Consistency Model

GFS has a relaxed consistency model:

Defined: After a successful mutation, all clients see the same data. (All replicas are identical.)
Consistent: All clients see the same data, but it may include fragments from different writes. (Replicas are identical but content may be mixed.)
Inconsistent: Different clients may see different data. (Replicas differ.)

Concurrent appends result in "consistent but undefined" regions -- all replicas have the same bytes, but the bytes may be interleaved from different appends. Application-level checksums and record markers are used to handle this.

This relaxed model is a deliberate choice. Stronger consistency would require synchronous coordination for every write, reducing throughput. For MapReduce-style workloads (write once, read many, tolerate duplicates), this is sufficient.

HDFS: The Open-Source Clone

HDFS (Hadoop Distributed File System) is the open-source implementation inspired by GFS. The architecture is nearly identical:

GFS	HDFS
Master	NameNode
Chunkserver	DataNode
64 MB chunks	128 MB blocks (default)
Operation log	EditLog + FsImage
In-memory metadata	In-memory metadata
Lease-based primary	Block pipeline

HDFS Write Pipeline

1. Client → NameNode: create file, allocate first block.
2. NameNode → Client: DataNode1, DataNode2, DataNode3 (pipeline order).

3. Client → DN1 → DN2 → DN3: data packets (pipelined).
   Each DataNode writes to disk while forwarding to the next.

4. ACKs flow back: DN3 → DN2 → DN1 → Client.

5. Client → NameNode: close file (or allocate next block).

This pipeline is chain replication with pipelining (each node starts forwarding before it finishes receiving), reducing end-to-end latency.

The Single-Master Problem

Why It Works

For most workloads, the single master is not a bottleneck:

Metadata operations are lightweight (file open, chunk location lookup).
Clients cache chunk locations and go directly to chunkservers for data.
The master handles ~100K metadata operations per second, which is sufficient for clusters with a few thousand clients.

When It Breaks

Too many files: Each file and chunk requires metadata in memory. Billions of small files exhaust memory.
Too many clients: Thousands of concurrent metadata requests can saturate the master.
Single point of failure: If the master crashes, the entire filesystem is unavailable until it recovers.

HDFS Federation

HDFS Federation splits the namespace across multiple NameNodes:

NameNode 1: /user/          ← owns this namespace subtree
NameNode 2: /data/          ← owns this namespace subtree
NameNode 3: /tmp/           ← owns this namespace subtree

All NameNodes share the same pool of DataNodes.

Each NameNode manages its own namespace and block mapping independently. DataNodes register with ALL NameNodes and store blocks from multiple namespaces.

Benefit: Horizontal scaling of metadata capacity. If one NameNode runs out of memory, add another NameNode for a different namespace.

Limitation: No cross-namespace operations (you cannot mv /user/file.txt /data/file.txt atomically). Applications must be aware of which namespace they are using.

HDFS High Availability

Standard HDFS uses a Standby NameNode that keeps a copy of the namespace:

Active NameNode ←──── shared EditLog (on NFS or JournalNodes) ────→ Standby NameNode

The active NameNode writes to the EditLog. The standby replays the EditLog to maintain an up-to-date namespace. On active failure, the standby takes over. Failover time: ~30 seconds (with automatic failover via ZooKeeper).

JournalNodes (a quorum of 3-5 nodes running a Paxos-like protocol) provide a fault-tolerant EditLog, replacing the NFS shared storage.

Erasure Coding

3x replication means every byte of data consumes 3 bytes of storage. For cold data that is rarely accessed, this is wasteful.

Erasure coding (specifically Reed-Solomon codes) reduces storage overhead from 3x to ~1.5x while maintaining the same fault tolerance.

3x Replication:
  Data block: 128 MB
  Stored as: 128 MB + 128 MB + 128 MB = 384 MB
  Overhead: 200%
  Can tolerate: 2 replica losses

Reed-Solomon (6,3):
  Data: 6 blocks of 128 MB = 768 MB
  Parity: 3 blocks of 128 MB = 384 MB
  Total: 1152 MB for 768 MB of data
  Overhead: 50%
  Can tolerate: 3 block losses (any 3 of 9 blocks)

HDFS Erasure Coding (added in HDFS 3.0):

Policy: RS-6-3-1024k
  6 data blocks + 3 parity blocks
  Each block: 1 MB (striped across DataNodes)

Write: client stripes data across 6 DataNodes, computes 3 parity blocks,
       writes parity to 3 more DataNodes.
Read: client reads from any 6 of 9 DataNodes. If some are unavailable,
      reconstruct from parity.

Trade-offs of Erasure Coding:

Aspect	3x Replication	Erasure Coding (RS-6-3)
Storage overhead	200%	50%
Fault tolerance	2 failures	3 failures
Read performance	Fast (any replica)	Slower (may need decoding)
Write performance	Pipeline to 3 nodes	Encode + write to 9 nodes
Recovery cost	Copy 1 block	Read 6 blocks, decode, write 1
Best for	Hot data	Cold/warm data

Real-World Usage

System	Based On	Key Difference
GFS	Original	Google internal, evolved into Colossus
HDFS	GFS clone	Open source, Hadoop ecosystem
Colossus	GFS v2	Distributed master, no single NameNode
Ceph	Independent	No single metadata server (CRUSH algorithm)
MinIO	S3-compatible	Object storage, erasure coding built-in

Colossus (Google's successor to GFS) eliminated the single-master bottleneck by distributing metadata across multiple machines using a database (BigTable). This allowed Google to scale to exabytes. The lesson: single-master is fine for petabytes, but exabyte scale requires distributed metadata.

Interview Application

When to mention GFS/HDFS:

"Design a distributed file storage system." -- GFS architecture: single master for metadata, chunkservers for data, 3x replication, large chunks.
"How does Hadoop store data?" -- HDFS: NameNode + DataNodes, 128 MB blocks, pipeline replication.
"How would you store petabytes of log data?" -- HDFS or S3 with appropriate partitioning.
"What is the small file problem?" -- Each file consumes NameNode memory for metadata. Millions of small files exhaust NameNode memory. Solution: combine small files into larger ones (SequenceFile, HAR).

What interviewers want to hear:

You understand the master/chunkserver split and why the master is not in the data path.
You know why chunks are large (64-128 MB): less metadata, better throughput.
You can explain the write pipeline (chain replication with pipelining).
You know the single-master limitation and how Federation addresses it.
You understand the 3x replication vs erasure coding trade-off.

Trade-offs

Erasure Coding

Advantage	Disadvantage
Simple architecture (single master)	Single master is a bottleneck/SPOF
High throughput for large files	Poor for small files (metadata overhead)
Automatic replication (3x default)	3x storage overhead
Fault tolerant (any 2 replicas lost OK)	Not designed for random writes
Scales to petabytes	Does not scale to exabytes (single master)
Write-optimized (append-only)	Relaxed consistency model

When NOT to Use HDFS

Small files: Millions of 1 KB files waste NameNode memory and chunk storage.
Low-latency random reads: HDFS is optimized for sequential reads of large files. For random access, use HBase (built on top of HDFS) or a key-value store.
Frequent updates: HDFS files are write-once (or append-only). If you need to update records in place, use a database.
Small clusters: HDFS overhead is not justified for clusters under ~10 nodes. Use local storage or NFS.

Common Mistakes

The NameNode stores data

The NameNode stores only metadata (file names, block mappings, permissions). All data resides on DataNodes. The NameNode never sees the actual file contents. This separation is what makes the architecture scalable.

Clients read from the NameNode

Clients contact the NameNode once to get block locations, then read directly from DataNodes. The NameNode is not in the data path. If it were, it would be a massive bottleneck.

HDFS supports random writes

HDFS supports only appends (and, as of HDFS 2.x, truncation). It does not support modifying bytes in the middle of a file. This is a deliberate design choice: append-only simplifies replication consistency. If you need random writes, use HBase.

More replicas are always better

3 replicas means 3x storage. For cold data (archival logs, old backups), this is wasteful. Erasure coding reduces overhead to 1.5x with equal or better fault tolerance. Hot data should be replicated (fast reads from any replica). Cold data should be erasure-coded (save storage).

HDFS Federation solves the single-master problem completely

Federation splits the namespace but each NameNode is still a single point of failure for its namespace. You still need NameNode HA (Active/Standby with shared EditLog) for each NameNode. Federation solves scalability, not availability.

GFS and HDFS are obsolete because of cloud object storage (S3)

S3 provides similar functionality (store large files, high durability, scalable) with lower operational overhead. For new systems, S3 (or GCS/Azure Blob) is often the right choice. But HDFS remains relevant for on-premises Hadoop clusters, for workloads that need data locality (compute near data), and for understanding the principles that S3 itself was built on.