One year ago I wrote about Paxos in the review of Leslie Lamport's work. I already read Raft paper and felt confident that I understood the single-decree Paxos... What a fool of me.
The most difficult part for me was:
The tricky part to me is understanding when the leader does not respond, how the new leader is chosen
I've been working a bit on Hermes. I enjoyed the protocol and wanted to implement it. At some point, I needed to implement Paxos. As I said, I was pretty confident I could do that...
I realised how little I understood Paxos, and this piece is an attempt to explain what I learned afterwards.
Single decree Paxos is a protocol to make sure multiple agents agree on a single decision. Paxos guarantees the following properties:
Paxos has 2 phases that alternate indefinitely.
Phase 1 is a pre-vote phase. It is about finding a safe value. Safe means that the value does not break consensus.
Phase 2 is the vote phase. It propapagates the safe value of phase 1. If it completes, the value is decided or chosen.
Participants in Paxos have different roles:
Even though roles are independent, a single node can be both a proposer, an acceptor or a learner. We are gonna work with a strong premise: every node takes on every role.
Clients query nodes of the system. We consider any kind of queries: read request or update/write requests. The role of the system is to execute the query. When the query enters the system, it is called a proposition or a value.
If the proposer has already a running proposition, it can refuse the query from the client. It can also queue the query for a future decision, but let's not get ahead of ourselves... The goal of single decree Paxos is to decide for a single value. Once the value is chosen, it can be safely executed.
Each node has the following state:
struct Node {
// current round of the node
// this state is shared between the proposer and the acceptor role
round: Round,
// Proposer state
proposed_value: Option<Value>,
prepares: HashMap<NodeId, Option<(Round, Value)>>,
accepts: HashSet<NodeId>,
// Acceptor state
accepted: Option<(Round, Value)>
// Learner
decided: Option<Value>,
}
The round
is the last round of Paxos protocol the node ran.
Remember that I said that the protocol repeats indefinitely.
We call each repetition a round of Paxos.
We only care about the latest round of Paxos (we see why later),
so the property is a singleton, not a collection.
You would think we'd start with the phase 1. On the contrary, we're gonna start with phase 2.
Let's say that proposer p
found a safe value v
in phase 1.
We see what safe entails in phase 1.
The role of phase 2 is to propagate this safe value to other nodes.
To do that, p
sends out messages p2a with the following information:
p
's current roundv
the safe valueUpon receiving the message, every acceptor a
answer with a p2b message.
P2b message is a confirmation of the p2a message.
It only contains the round
of the p2a message.
Acceptors run the following pseudo rust code:
impl Acceptor {
fn on_msg(mut self, p2a: P2aMessage) -> Option {
if p2a.round < self.round {
return None;
} else {
self.round = p2a.round;
self.accepted = Some((p2a.round, p2a.value));
return Some(P2bMessage(p2a.round));
}
}
}
Let's go through the pseudo code:
round
of a
to the message round, accept the value for said round, and answer with a p2b messageFor now persisting accepted values can't be explained until we cover phase 1. Let's just ignore it for now.
Once the proposer received a majority of P2b messages, it knows the value is not only safe, it is decided. A value is decided when it is the only value that can be safe in phase 1. Paxos is finished for this node, the node broadcasts a new message:
impl Proposer {
fn decided(value: Value) {
self.decided = Some(value);
self.broadcast(Learn { value: value })
}
}
Whenever a node receives the Learn
message, the algorithm is terminated for it as well.
When a node learned a value, it answers to any request with a Learn
message.
This way the decision is propagated to other nodes.
It can persist the decision with the following pseudo-code:
impl Acceptor {
fn on_msg(mut self, learn: Learn) {
self.decided = Some(learn.value);
}
}
We stated that one node at a time can fail. What happens in such case in phase 2?
Say an acceptor node fails. The proposer keeps its majority, so the value is decided. The two live nodes have learned the value. When the failed node comes back online, it needs to learn the decided value. There's two options:
There is not a single answer. Paxos only guarantees that a single value is decided, not how this decision is propagated. Depending on your needs, you can implement any solution.
The other failure scenario requires more thoughts: what if the proposer fails after broadcasting the P2a message?
Even worst, what if the node failed before the broadcast completes.
In any case, the value is not decided.
Until a proposer starts the Paxos protocol again, you end up in this situation for round n
:
Values of self.accepted | Node 1 (proposer) | Node 2 | Node 3 |
---|---|---|---|
Case 1 | Some(n , v ) | Some(n , v ) | Some(n , v ) |
Case 2 | Some(n , v ) | Some(n , v ) | None |
Case 3 | Some(n , v ) | None | None |
For each case, either nodes received the message or are waiting to receive one. We chose that nodes are both proposer and acceptor. Therefore the proposer node also accepted its own proposal.
If the proposer comes back online, it can launch a new round and succeeds to choose its value.
We need to find a process to recover Some(n, v)
if possible.
This is where phase 1 comes into play.
As I explained in the previous paragraph, phase 1 is first run to check which value is safe to be submitted in phase 2.
Say proposer p
wants to propose a value v2
.
It first set a round
(we see how to choose round
later).
It broadcasts a P1a message to every node. P1a message only consists of the round
.
Acceptors answer to P1a message with their latest accepted value, meaning the value they accepted in phase 2 of the last round.
In pseudo-rust-code, that would be:
impl Acceptor {
fn on_msg(mut self, p1a: P1aMessage) -> Option {
if p1a.round > self.round {
self.round = p1a.round;
return Some(P1bMessage { round: p1a.round, value: self.accepted });
} else {
return None;
}
}
}
Note that acceptors only answer to round that are above the current round. This guarantees progress. If a proposer fails and come back up later, it could think its round is still active. Acceptors ignoring its messages signal that its information is stale.
On its side, the proposer waits for a majority of answers from acceptors.
Every node is both a proposer and an acceptor, so the proposer already receives the p1b message from itself.
Note that the acceptor of the same node can send another value than v2
if it participated in a phase 2 of a previous round.
If there are 3 nodes, proposer waits for a single response.
But what does it do with the majority of reponse? 3 cases arise:
p1b.value
is None
: any value is safe to be proposed.
No proposer submitted a single value.
v2
is safe to be submitted in phase 2.p1b.value
is Some(v2)
: v1
is the only safe value.
Proposer proceeds to phase 2 with v1
.
Note that the proposer must drop its original value v2
.p1b.value
is different: this case is tricky.
The safe value to choose is the latest not None
value.
If it was accepted, it was a safe value.
Besides, the only the latest value is safe because it might contradict previous round.If the proposer started phase 1 in the first round, and there are no conflicting request, it is the first case.
The first case can also happen if no proposer reached phase 2 in preceeding rounds.
Case 2 happens in the failure scenario of our last part.
If a proposer already reached phase 2 then failed, the next proposer will have to assume v1
.
In any case the proposer has a safe value to submit for phase 2!
What happen if there's a failure in phase 1?
If an acceptor fails, there's no problem: the 2 remaning node form a majority. It can decide a value without the failing node.
If the proposer fails at this round, there's no real impact on the output of the protocol. The purpose of this round is to survey other nodes for a safe value. A failure of the proposer means that another proposer can start another round safely. It can even try this round if the failing proposer did not gather a majority of votes.
Now that we covered the basics of Paxos, let's exercise our understanding with some 3 nodes scenarios.
In our first case, a single node has a value to propose. It times out then comes back up before another proposer has a chance.
The role of proposer and acceptor is separated to have a clearer view. Proposer 1 and acceptor 1 correspond to node 1. An × indicate a failing node. An **
Rounds | Proposer 1 | Acceptor 1 | Acceptor 2 | Acceptor 3 | Event |
---|---|---|---|---|---|
0 | P1a(round = 0) | P1b(round = 0, value = None) | P1a broadcast | ||
0 | P1b(round = 0, value = None) | Acceptor 2 accepts | |||
0 | × | × | Node fails | ||
0 | × | × | No other node propose, the protocol stalls | ||
0 | P2a(round = 0, value = v) | P1b(round = 0, value = v) | Node 1 comes back online, progress can resume | ||
0 | P2b(round = 0, value = v) | ||||
0 | Learn(value = v) | Proposer 1 learned that v is the consensus value |
There is no conflicting proposer. The failure only blocks the proposing node. Note that node 3 did not participate. We don't show messages that are ignored, but in this case it also casts a vote.
Let's get some conflicts in there to have a more exciting scenario.
Consider that proposer 2 wants to get v2
through.
Rounds | Proposer 1 | Acceptor 1 | Proposer 2 | Acceptor 2 | Acceptor 3 | Event |
---|---|---|---|---|---|---|
0, 0, 0 | P1a(round = 0) | P1b(round = 0, value = None) | P1a broadcast | |||
0, 0, 0 | × | × | Node 2 fails | |||
0, 0, 0 | × | × | P1b(round = 0, value = None) | Acceptor 3 answers | ||
0, 0, 0 | P2a(round = 0, value = v1 ) | P2b(round = 0, value = v1 ) | × | × | Node 1 starts phase 2 | |
0, 0, 0 | × | × | P2b(round = 0, value = v1 ) | Acceptor 3 accepts v1 | ||
0, 0, 0 | × | × | Node 2 recovers whilst node 1 fails while deciding for v1 | |||
0, 1, 1 | × | × | P1a(round = 1) | P1b(round = 1, value = None) | Proposer 2 start round 1 because of a timeout | |
0, 1, 1 | × | × | P1b(round = 1, value = (0, v1 )) | Acceptor 3 answer to round 1: it accepted v1 in round 0 | ||
0, 1, 1 | × | × | P2a(round = 1, value = v1 ) | P2b(round = 1, value = v1 ) | Proposer 2 is forced to broadcast v1 in phase 2 | |
0, 1, 1 | × | × | P2b(round = 1, value = v1 ) | Acceptor 3 answers | ||
1, 1, 1 | Learn v1 | Learn v1 | Node 1 comes back online, every nodes learn v1 |
This scenario is a bit more complex.
Node 2 then node 1 fail one after the other.
Note that they still come to the same value: consensus is reached!
The key point is that proposer 2 proposed v1
instead of its own proposition in phase 2.
The role of phase 1 is key for that to happen.
I hope this article helps you with single-decree Paxos. I didn't get into 5 nodes examples for clarity sake. But you can think for yourself of more complex examples. Paxos is really hard so don't hesitate to ask questions about it.
There's a lot to cover from there. Paxos was an early and elegant solution to consensus. It came with a formal protocol to solve the consensus problem. The original paper also included also strong proofs of safety and progress (liveness in literature). You can find more information on the wikipedia page.
Understanding basic Paxos allows to understand many optimisations of the Paxos-family protocols. Yes! You read well. There's a whole family of protocols based on Paxos that try to optimize the original protocol. Multi-Paxos, Fast Paxos, Fast Flexible Paxos, Egalitarian Paxos, Vertical Paxos to name a few...
In the next article, I'll try to explain what happens once a value is decided. How can this be applied to implement a distributed database for instance.
Take care!