Paxos series: Implementing a distributed system

In the last post of the series, we covered single decree Paxos. You might now wonder: what is all of this useful for?! Paxos is abstract and hard to understand, no wonder you are confused. Leslie Lamport focused on the formal description of a general solution to a formal problem. In this post, we cover the ways it can be used to implement a resilient distributed system.

State Machine

Before understanding Paxos's applicability, let's talk about state machines. A state machine is a computation model. It allows describing how a system behaves. Although not as powerful as Turing machines, they offer a simpler representation.

A state machine, as its name implies, consists of a state S, and a function apply. apply can be defined as the following in pseudo-code:

apply: State(current) + Input ← State(next) + Output

If State(current) = State(next) then the input is a read operation. Otherwise, it's an update or a write operation.

Any system can be interpreted as a state machine (SM). For instance, let's take a SQL database. The inputs are SQL queries. The output is the answer of the database. The state is the collection of tuples persisted in an optimized filesystem representation.

State Machine Replication

SMR stands for State Machine Replication. It begs the question: can multiple nodes act as a single SM? Even better: can an SMR still make progress while some nodes are failing?

Paxos can help nodes agree on a single decision in face of node failure. Intuitively Paxos could also help replicate many decisions, right? But first, we need to cover ways to replicate an SM.

As stated before an SM consist of a State S and the apply function. Since the apply function is static, replicating it is not an issue. For S it's more complex. It potentially changes for every input received. There are two ways:

  1. Replicate S entirely: it's inefficient, but the easy optimization is to replicate S(next) - S(current) which is lighter
  2. Replicate the log of inputs and use apply to rebuild the S

The two options are possible, but the second option leads to a more intuitive solution. We're gonna replicate an ordered collection of inputs I. We call the log at step n the collection of inputs:

Log(n) = {I(0), I(1), ..., I(n)}

We can now recover S(n) by recursively calling apply with inputs from Log(n).

Enters Paxos

Let's figure an algorithm to use Paxos (or another consensus protocol) to replicate the log:

Client submits query "q" to Node "n"
Log(n) = {I(0), I(1), ..., I(n)}
"n" proposes I(n+1) = q
    -> If decided, answer the client
    -> If competing proposition is decided, repeat for n = n + 1

This algorithm inherits Paxos properties: it offers progress guarantees and safetiness. It is relatively simple because Paxos answers the complex problem of making a decision. Without Paxos, it wouldn't be possible to achieve safetiness or progress.

From Single Decree To A Log

On the surface, Paxos and the log algorithms are pretty orthogonal. No need to understand one to implement the other. You could replace Paxos with any consensus protocol (Fast Paxos, Multi Paxos, etc.) and you're good to go.

Except that it is not as easy. The way you implement the log interacts with Paxos in corner cases. The closer you get to a realistic implementation, the more problems you face. As the author of Raft puts it:

The composition rules for multi-Paxos [Paxos with a log] add significant additional complexity and subtlety.

For instance, what if a node fails? Paxos guarantees progress, so the remaining nodes still decide on inputs. But when the node comes back up, it is late and needs to learn the decisions it missed. Paxos does not provide a solution to this issue. Implementing a successful distributed system requires answering these corner cases.

Say running nodes are at state S(current), while failing node is at state S(old). Once the node is up, it needs to learn inputs I(old + 1), ..., I(current).

This log repairing process can be achieved by:

  1. Node detects it is outdated by receiving a message
  2. Node stops accepting queries from clients until it is repaired
  3. Node "asks" for I(old + 1), ..., I(current) to other running nodes
  4. When the log is repaired, the node can accept client queries again

This is possible if you clear out 2 questions:

Decision Making While Repairing

The first problem isn't one. A repairing node can take part in Paxos decision for I(current + x). It does not need to know the preceding inputs to take part. However, it needs to wait to apply inputs to its internal state machine until it has received the next input.

Say node is at state S(100) and receives I(102), it buffers the input until it receives I(101). It then applies I(101) and I(102) to reach the correct state S(102).

This is why consensus protocols are also said to solve the total ordering of messages. You can guarantee that messages are received in the perfect order by the recipient with a consensus protocol. Consensus = Total order.

Safely Restoring A Decision

You could naively repair decisions I(old + 1), ..., I(current) by asking other nodes the learned decree. This works for the happy scenario, but consider this edge case:

Node N1: repairing I(101), I(100) = v100, I(101) = ?

Node N2: Running, I(100) = v100, I(101) = Accepted v101 (phase 2 of Paxos)

Node N3: Failed, I(100) = v100, I(101) = v101 (failed before broadcasting message)

N1 cannot ask N3 for I(101), it won't answer. N1 cannot ask N2 for I(101), it has not decided anything yet. Is I(101) lost?

This is where the knowledge of Paxos helps. Remember that a value that is decided is the only safe value. So if another round of Paxos is run, only v101 is safe. Therefore, by running another round, N1 will decide v101!

"Asking" for missed inputs is running another Paxos round. If nodes already learned the decided value, then the recovery round is interrupted. Otherwise, the already accepted value is decided by Paxos completion.


Knowledge and understanding of Paxos algorithm is crucial to implement an SMR. As Paxos is not easy, implementing an SMR is not easy! We managed to implement one and covered some edge cases. We haven't discussed performance. Performance is key to make sure the system is worth deploying in production. It'll be the subject of the next and last post of the series.

PS: you can prove that the repair algorithm works with a TLA+ specification. Writing it led me to a deeper understanding of Paxos, it was an epiphany! If you are not sure of your algorithm, check them with TLA+ and everything becomes crystal clear.