Lineage-driven Fault Injection (2015)

Building fault tolerant systems is hard, like really hard. There are are a couple approaches to building fault-tolerant systems, but none are perfect.

Lineage-driven Fault Injection (LDFI) is a top-down approach which uses lineage and fault injection to carefully discover fault-tolerance bugs in distributed systems. If a bug is found, the lineage of the bug is given to the user to help discover the root cause of the bug. If no bugs are found, LDFI provides some guarantees that there are no possible bugs for that particular configuration.

In a nutshell, LDFI takes a program written in something like Bloom, inputs to the program, and some parameters (e.g. to bound the length the execution, to bound the number of faults, etc.). It runs the given program on the given input and computes a provenance graph of the output. It then carefully selects a small number of faults that invalidate every derivation. It then injects these faults into the system to see if it surfaces a bug. This is repeated until a bug is found or no such bugs exist.

In this paper, Alvaro presents LDFI and an LDFI implementation named Molly.

System Model

LDFI injects faults, but not every kind of fault. We'll explore dropped messages, failed nodes, and network partitions. We won't explore message reordering or crash recovery. While we sacrifice generality, we gain tractability.

LDFI is governed by three parameters:

  1. LDFI does not operate over arbitrarily long executions. A parameter EOT (end of time) prescribes the maximum logical time of any execution examined by LDFI.
  2. Distributed systems typically tolerate some number of faults, but cannot possible tolerate complete message loss for example. A parameter EFF (end of finite failures) < EOT sets a logical time after which LDFI will not introduce faults. The time between EFF and EOT allows a distributed system to recover from message losses.
  3. A parameter Crashes sets an upper limit on the number of node crashes LDFI will consider.

A failure specification is a three-tuple (EOT, EFF, Crashes). Molly will automatically find a good failure specification by repeatedly increasing EOT until programs can create meaningful results and increasing EFF until faults occur.

We assume programs are written in Dedalus and that pre- and postconditions are expressed as special relations pre and post in the program.

LDFI

Consider an interaction between Molly and a user trying to implement a fault-tolerant broadcast protocol between three nodes A, B, and C where A begins with a message to broadcast. Our correctness condition asserts that if a message is delivered to any non-failed node, it is delivered to all of them.

Molly

Molly begins by rewriting a Dedalus program into a Datalog program.

It then rewrites the Datalog program to maintain its own provenance graph, and extracts lineage from the graph via recursive queries that walk the graph.

Given an execution of the Datalog program, Molly generates a CNF formula where the disjuncts inside each conjunct x1 or .. or xn represent a message drop or node failure that would invalidate a particular derivation. If all derivations can be invalidated, then the formula is unsatisfiable and the program is fault-tolerant. If the formula is satisfiable, then each satisfied conjunct represents a counterexample of the derivation. Molly uses a SMT solver to solve these formulas.