Paxos Made Moderately Complex

Updated on 2021-08-20

Paxos is a distributed consensus protocol for deterministic state machine replication in an asynchronous environment (i.e., there are no bounds on timing) that admits crash failures.

The discussion below is accompanied by my unfinished reference implementation in Rust.

Check out TLA+ specifications of Paxos protocols, e.g., Paxos. It is interesting to observe how the algorithmic presentation of Paxos differ from the declarative TLA+ specification.

Safety

Paxos shines for its simplicity in correctness reasoning.

A replica may receive multiple decisions for the same slot. Paxos guarantees that, for any slot s, any decision for s received by any replica contains the same command. Namely, for ∀s: slot, ∀ρ1 ρ2: replica, and ∀c1 c2: command, if (s, c1) is in ρ1.decisions and (s, c2) is in ρ2.decisions, then c1 = c2.

This follows from the following property of leaders. Suppose that for ∀α:acceptor among a majority of acceptors, (b, s, c) in α.accepted. Then, ∀b b’: ballot, if b’ > b and a commander is spawned for (b’, s, c’) for some c’, then c = c’.

To understand Paxos, we need to understand the proof of this property. It is proved by induction on b’.

Suppose a commander is spawned for (b’, s, c’). Hence, a majority of acceptors have adopted b’.

  • Suppose the leader did not receive any pvalue for s. This means a majority of acceptors have not and will not accepted b. Contradiction.

  • Let (b’‘, s, c’‘) be the maximum pvalue received. We have c’’ = c’.

    • If b’’ = b, then c’’ = c.
    • Suppose b’’ < b. Since b’’ < b’, by induction hypothesis, c’’ = c.

Termination

Paxos is no exception to the FLP impossibility result, and Paxos does not guarantee termination. For instance, two leaders can preempt each other forever, causing livelock. However, f all leaders but one fail, Paxos is guaranteed to terminate, if we assume that messages are eventually delivered (but possibly out of order).

Under fairly weak assumptions about timing, Paxos can be made guaranteed to choose a proposal. We assume the clock drift of processes and message delivery delay are bounded. (We don’t need to know what those bounds are, but only that such bounds exist.) When a leader λ discovers (via a preempted message) that there is a higher ballot with leader λ’ active, rather than starting a new scout with an even higher ballot number, λ starts monitoring λ’ by pinging it on a regular basis. Only if λ’ stops responding, λ selects a higher ballot number and starts a scout. The value of timeout interval can be chosen with a TCP-like additive increase, multiplicative decrease approach. If a ballot gets preempted, the value of the timeout interval is multiplied by some small factor larger than 1. With each chosen proposal, the timeout is decreased linearly.

Leader-replica colocation

In practice, each machine that runs a replica also runs a leader. When this is the case, the replica can send proposals to its local leader, say λ, rather than broadcasting the request to all leaders. If λ is passive, monitoring another leader λ’, it forwards the proposals to λ’.

References

Paxos Made Moderately Complex (ACM Computing Surveys 2015)