January 08, 2016

Single-Decree Paxos

Paxos is an algorithm which maintains a distributed, consistent log shared by a set of networked computers. Single-Decree Paxos is a slightly simpler algorithm that solves consensus and is used to implement Paxos. Both algorithms were first described by Leslie Lamport in "The Part-Time Parliament" and later described more directly in "Paxos Made Simple". In this article, we describe what consensus is, why it's so hard, and how Single-Decree Paxos solves it. Note that we won't discuss full blown Paxos.

What is Consensus?

Assume computers are fail-stop and connected by an asynchronous network.

  • By fail-stop, we mean that any computer can crash at any time, and that any crashed computer can restart after any amount of time. When a computer crashes, it loses all of the data it is currently operating on, but computers can write values to stable storage and can recover these values upon restart.
  • By asynchronous, we mean that the network can drop, duplicate, re-order, and arbitrarily delay messages. We only assume that the network doesn't corrupt messages and that messages are eventually delivered if they are repeatedly sent.

Consider a set $\{a, b, c\}$ of computers that want to agree on a chosen value. Some computers propose values, and other computers accept values; some computers do both. For example, perhaps the computers want to choose a leader amongst themselves. $a$ could send a message to $b$, $c$, and itself proposing that $b$ should be leader, and all three computers could accept this proposal. The act of a set of computers choosing a single value is known as consensus.

Single-Decree Paxos is one example of an algorithm that can be used to reach consensus. In general, in order for a consensus algorithm to be safe, it has to meet a set of rather obvious conditions whenever it terminates:

  • Only one value can be chosen. Duh!
  • Only values proposed can be chosen. If this weren't a requirement, you could construct a rather silly yet still correct consensus algorithm in which all computers instantly agree on some predefined value.

Moreover, it's desirable that a consensus algorithm guarantee some form of progress. The Fischer, Lynch, Paterson impossibility result tells us that no consensus algorithm can guarantee that it always terminates given fail-stop computers in an asynchronous network, but we'd still like some promise that a given consensus algorithm usually terminates after sufficient time given enough computers haven't crashed. For example, a consensus algorithm that fails to terminate after a single computer failure doesn't guarantee much progress. On the other hand, a consensus algorithm that can still operate correctly even after a minority of computers have failed (e.g. Single-Decree Paxos) guarantees a stronger notion of progress.

Why is Consensus Hard?

Initially, consensus doesn't seem like that hard of a problem. Checking to see if a boolean formula is satisfiable, finding the minimum number of colors needed to color a graph, or checking to see if two graphs are isomorphic: these problems seem tough! Having computers choose a single value; seems kinda easy, huh? Well, it turns out that consensus is tougher than it sounds! To convince ourselves of this fact, let's consider a couple simple consensus algorithms we might think of and show why they fail to solve consensus.

Perhaps the simplest algorithm we could think of is to predetermine some leader which has the exclusive responsibility of choosing the value. Proposers send their proposals to the leader, and the leader accepts the first value it receives, deeming it chosen. While this algorithm is surely safe, it doesn't guarantee much progress. Whenever the leader fails, the algorithm is doomed to not terminate!

Here's a slightly more complicated algorithm that tries to guarantee a bit more progress. The main idea is that we can tolerate more computer failures by sending proposals to more computers. Proposers send proposals to all computers, and all computers accept the first value they receive. Whenever a majority of computers accept a proposal, we'll say it's chosen. This algorithm can still operate even when a minority of computers have crashed; yay! But unfortunately, if multiple proposers concurrently propose values to acceptors, it is possible to reach a split vote where no single proposal has a majority of votes. For example, consider a five computer cluster: $\{a, b, c, d, e\}$. Assume $a$, $c$, and $e$ propose values 1, 2, and 3 respectively. If $a$ and $b$ receive proposal 1 first, $c$ and $d$ receive proposal 2 first, and $e$ receives proposal 3 first, then none of proposal 1, 2, or 3 has a majority of acceptances. Thus, this algorithm can fail to terminate even when no computer fails!

Single-Decree Paxos

Now that we've familiarized ourselves with consensus and convinced ourselves that it's a challenging problem to solve, let's introduce Single-Decree Paxos: an algorithm which successfully solves consensus. We let every computer in our cluster act as a proposer and an acceptor. To tolerate a minority of computer failures, we'll say a value is chosen when it is accepted by a majority of acceptors. First, we discuss the invariants of the algorithm, then we discuss why the invariants imply the algorithm is safe, and finally we present the algorithm.

Invariants

Single-Decree Paxos maintains two invariants:

  1. We let each proposal be of the form $(v, i)$ where $v$ is an arbitrary proposed value and $i$ is an identifier. Our first invariant is that all proposal identifiers are unique. An easy way to construct unique identifiers is to have each computer $c$ maintain a monotonically increasing integer $i$. $i$ is stored on disk and is incremented after $c$ sends a proposal. Whenever computer $c$ sends a proposal, $c$ tags the proposal with the id $ci$. For example, if $a$ proposes values $\text{foo}$, then $\text{bar}$, then $\text{baz}$, its proposals would be of the form: $(\text{foo}, a1)$, $(\text{bar}, a2)$, and $(\text{baz}, a3)$. Also note that we can impose a total ordering on identifiers by comparing them lexicographically (e.g. $a1 < a2 < b1$).
  2. Consider a proposal $(v, i)$ that is sent to some majority $C$ of computers. Let $P$ be the set of proposals with an identifier smaller than $i$ accepted by any member of $C$. For example, if $(v, i) = (\text{apple}, d1)$, $C = \{a, b, c\}$, and $a$, $b$, and $c$ have accepted proposals $\{(\text{banana}, a1), (\text{grape}, b1), (\text{banana}, e1)\}$, $\{\}$, and $\{(\text{grape}, b1), (\text{peach}, c1)\}$ respectively, then $P = \{(\text{banana}, a1), (\text{grape}, b1), (\text{peach}, c1)\}$. Note that $(\text{banana}, e1) \notin P$ because $e1 > d1$. Let $v'$ be the value associated with the largest identifier in $P$. In our example, the largest identifier in $P$ is $c1$, so $v' = \text{peach}$. Our second invariant states that $v$ must equal $v'$. In other words for all proposals $(v, i)$, the value $v$ of the proposal sent to a majority of computers must equal the value of the proposal with the largest identifier less than $i$ accepted by any of the computers. Our simple example doesn't meet this invariant because the $\text{apple}$ in our proposal should be $\text{peach}$.

Why Invariants Imply Safety

Invariant 1 is rather simple and uninteresting. Invariant 2, on the other hand, is the true workhorse behind ensuring safety. Consider an execution of Single-Decree Paxos where $(v, i)$ is the first chosen proposal; that is $(v, i)$ is the first proposal accepted by some majority of acceptors, which we'll denote $C$. After $(v, i)$ is chosen, Invariant 2 tells us that all proposals $(v', i')$ with $i' > i$ will have $v = v'$! Here's why. Consider the first proposal $(v', i')$ issued to a majority $C'$ after $(v, i)$ is chosen. Invariant 2 tells us that $v'$ must be equal to the value of the proposal with the largest identifier less than $i'$ accepted by any computer in $C'$. Since $i'$ is the first proposal larger than $i$, $i$ is the largest identifier of any proposal accepted by any computer. Moreover, all computers in $C$ have accepted $(v, i)$ and since $C$ and $C'$ overlap, some computer in $C'$ must have accepted $(v, i)$ too. Putting these facts together, Invariant 2 says $v' = v$. We can apply this reasoning iteratively to see that every proposal with identifier $i' > i$ has $v' = v$. This means that after a value is chosen, the set of computers will never accept a value other than the chosen one because all larger proposals are proposals for the chosen value!

The Algorithm

Single-Decree Paxos is a two-phase protocol. Assume a proposer wants to propose some value $v_0$. In the first phase, proposers send a prepare message with identifier $i$ to a majority (or more) of acceptors, and the acceptors reply with the largest value they have accepted (if any) with identifier smaller than $i$. Once the proposer receives a majority of responses to its prepare request, it makes a decision. If none of the acceptors it contacted have accepted a value with identifier less than $i$, then it's free to propose $v_0$. Otherwise, if one or more acceptors have accepted some value with identifier less than $i$, it throws away $v_0$ and instead proposes the value with the largest identifier returned by the acceptors. This enforces Invariant 2.

Moreover, when an acceptor receives a prepare message with id $i$, it promises to never accept a proposal with identifier less than $i$. This is to ensure that between a proposer deciding a value to propose in the first phase and proposing it in the second phase, another proposer doesn't get some other value with a smaller identifier chosen. This also enforces Invariant 2.

In the second phase, the proposer sends an accept request with its proposed value determined in the first phase to a majority of acceptors. Acceptors accept a value if they haven't already promised in the first phase not to, and if a majority of acceptors accept the proposal, the value is chosen.