VOL.01 // ISS.04

How Replicated Machines Achieve Agreement Without Trust

In 2014, Ongaro & Ousterhout showed that the hardest problem in distributed computing — getting crashed machines to agree — could be solved by an algorithm simple enough for a human to hold in their head.

scroll to begin
CLUSTER: IDLE
0
Term
0
Committed
--
Leader

The Problem

You have five servers. Each can crash at any time. Messages between them can be delayed or lost. Yet clients expect a single, consistent service.

This is the consensus problem: getting unreliable machines to agree on a sequence of operations, as if they were one reliable machine.

All Start as Followers

Every node begins in the follower state. Followers are passive — they listen for heartbeats and respond to requests, but never initiate actions.

Each follower runs an election timeout — a random countdown between 150 and 300 ms. If it expires without a heartbeat, something has gone wrong.

The Timeout Fires

Node C's timer runs out first. It assumes the leader has crashed, increments its term number, and transitions to candidate state.

The randomized timeout breaks symmetry — a little randomness prevents a lot of deadlock.

Requesting Votes

The candidate votes for itself, then sends RequestVote RPCs to every other node. Each node votes for at most one candidate per term, first-come-first-served.

Critical rule: a node will not vote for a candidate whose log is less up-to-date than its own.

A Leader Emerges

Node C receives votes from a majority (3 of 5, including itself). It becomes leader and immediately begins sending heartbeats.

Heartbeats are empty AppendEntries RPCs — proof of life that reset followers' election timers and prevent unnecessary elections.

Log Replication

A client sends a command. The leader appends it to its log with the current term number, then sends AppendEntries RPCs to all followers.

Each follower checks the Log Matching Property: if the preceding entry matches, it appends. If not, the leader retries with earlier entries until the logs converge.

Commitment

Once the leader has replicated an entry to a majority of nodes, that entry is committed — it will never be lost or overwritten.

The leader advances its commitIndex, notifies followers, and all nodes apply the entry to their state machines. The command takes effect.

Leader Failure

The leader crashes. Followers stop receiving heartbeats. After their election timeouts expire, a new election begins.

The election restriction ensures the new leader has every committed entry. Safety is never violated — only liveness is temporarily lost.

Recovery

A new leader is elected within milliseconds. The crashed node eventually restarts as a follower and catches up by receiving the entries it missed.

The cluster continues serving requests. Clients may not even notice the interruption. This is Raft: agreement without trust, continuity without perfection.

The Algorithm Humanity Could Read

For thirty years, distributed consensus had a canonical solution — Paxos — that almost nobody could implement correctly. Google's team spent years debugging their version. Apache built a whole new protocol rather than fight with it.

Ongaro and Ousterhout's radical insight was that understandability is a first-class engineering requirement. An algorithm that engineers cannot hold in their heads is an algorithm they cannot implement safely. Raft proved them right: within a decade, dozens of correct implementations appeared in every major language.

Terms: Raft's Logical Clock

Time in Raft is divided into terms of arbitrary length. Each begins with an election. The term number acts as a distributed consistency check — stale messages from old terms are rejected immediately.

Each colored block is a term. Green indicates a successful leader; amber marks elections that split the vote and produced no leader. The term number only increases — it is Raft's logical clock.

The Machines That Run on Agreement

Raft is not an academic exercise. It is the consensus backbone of Kubernetes (via etcd), CockroachDB, TiKV, and HashiCorp Consul. When you deploy a container, commit a distributed transaction, or register a service, Raft is the mechanism ensuring that the operation is consistent and durable.

Every time a hospital records patient data across replicated databases, every time a bank processes a transaction through a distributed ledger — the correctness of that operation rests on the same three mechanisms: leader election, log replication, and the safety guarantee that committed entries are never lost.

Log Replication: The Majority Rule

An entry is committed once replicated to a majority. In a 5-node cluster, that means 3 nodes. This visualizes how entries flow from leader to followers and become committed.

Each row is a node's log. Green entries are committed (replicated to a majority). Dim entries exist only on one node and may be overwritten. The leader's log is always authoritative.

Decomposition as Design

Raft's deepest contribution is not a new theorem but a design principle: decompose the hard problem into subproblems that humans can reason about independently. Leader election, log replication, and safety are each understandable on their own. Together they produce a system that is provably correct and practically buildable.

This is the lesson that extends beyond distributed systems. In any engineering discipline, the systems that endure are not the cleverest — they are the most legible.

Build Your Own Raft Cluster

Configure a cluster, crash nodes, trigger elections, and send log entries. Watch Raft maintain consensus in real time.

Cluster idle. Press RUN to start.
Cluster State
Log Entries Over Time