Chain Replication for Supporting High Throughput and Availability (2004)

File systems and databases are two examples of storage systems. A storage service sits somewhere between a file system and database and is expected to

Chain replication is a simple algorithm to implement a distributed storage service with high throughput, high availability, and linearizability (yup, that's right; we've got a CA system on our hands) using fail-stop servers.

Storage Service Interface. A storage service supports two operations:

query(o, opts) performs a query against the object with id o returning something specified by opts. A query is idempotent and does not modify the object. update(o, n, opts) performs a (possibly non-deterministic) computation on the old value of o and n to produce a value v that is assigned to the object. It is not idempotent.

Storage services are allowed to ignore any request received from the user. Since this behavior is indistinguishable from a request being dropped by the network, clients must handle this behavior anyway.

Chain Replication Protocol. Servers are assumed to be fail-stop. That is, servers aren't Byzantine and servers an detect when other servers fail.

A chain replicated service with t + 1 nodes can tolerate t failures before sacrificing availability. The t + 1 nodes arrange themselves into a linear chain with the first node called the head and the last node called the tail. There are three components to the algorithm:

Ignoring failures for the moment, note that this provides linearizability. All operations are totally ordered by the tail. Also note that reads are cheap (they touch only 1 node) and writes are expensive (they touch all nodes).

Coping with Server Failure. Failures are handled by a single master which is assumed to never fail. In practice, this can be achieved by replicating the master using something like Paxos. The master

Let Hist_i be the history of an object at node i. The history of an object is just the sequence of operations applied to the object since the beginning of time. Let Hist_i <= Hist_j if the history at node i is a prefix of the history at node j. The Update Propagation Invariant says that if node i precedes node j in the chain then Hist_j <= Hist_i. In words, each node has a prefix of its predecessor's history. Using this invariant, we analyze three failure scenarios.

To avoid this, each node i maintains a set Sent_i of pending updates that it has sent downstream that may not have been processed yet by the tail. Whenever a node sends an update downstream, it adds it to Sent_i. When the tail processes an update, it sends an acknowledgement to its predecessor. Upon receiving an acknowledgement for an update, node i removes the update from Sent_i and forwards the acknowledgement to its predecessor.

When S fails, S- must sent Send_S- to S+ before it begins forwarding new updates. Doing this preserves the Update Propagation Invariant.

When a node fails, the number of non-crashed nodes in the chain decreases. The fewer the nodes, the more likely it is to lose availability. Thus, it is desirable to add nodes to the chain. A new node can be added anywhere in the chain, but it's easiest to do at the end. The new tail receives the history from the old tail. While this is happening, the old tail is free to process updates, so long as it records them in Sent_i. When the new tail is ready, the old tail forwards the pending operations and the master informs the servers and clients of the new node.

Primary/Backup Protocols. In a Primary/Backup system, a single master replicates operations to a set of backups in parallel. Chain replication is similar to primary/backup techniques, but there are a few differences. First, a query at a primary may need to stall waiting for a pending update to be acknowledged by a backup. With chain replication, queries never stall. On the other hand, chain replication has much higher update latency that primary/backup systems. Moreover, when nodes in a chain replicated system fail, there is typically fewer messages that have to be sent to resolve the failure compared to when a node fails in a primary/backup system.