Building fault tolerant systems is hard, like really hard. There are are a couple approaches to building fault-tolerant systems, but none are perfect.
Lineage-driven Fault Injection (LDFI) is a top-down approach which uses lineage and fault injection to carefully discover fault-tolerance bugs in distributed systems. If a bug is found, the lineage of the bug is given to the user to help discover the root cause of the bug. If no bugs are found, LDFI provides some guarantees that there are no possible bugs for that particular configuration.
In a nutshell, LDFI takes a program written in something like Bloom, inputs to the program, and some parameters (e.g. to bound the length the execution, to bound the number of faults, etc.). It runs the given program on the given input and computes a provenance graph of the output. It then carefully selects a small number of faults that invalidate every derivation. It then injects these faults into the system to see if it surfaces a bug. This is repeated until a bug is found or no such bugs exist.
In this paper, Alvaro presents LDFI and an LDFI implementation named Molly.
LDFI injects faults, but not every kind of fault. We'll explore dropped messages, failed nodes, and network partitions. We won't explore message reordering or crash recovery. While we sacrifice generality, we gain tractability.
LDFI is governed by three parameters:
A failure specification is a three-tuple (EOT, EFF, Crashes). Molly will automatically find a good failure specification by repeatedly increasing EOT until programs can create meaningful results and increasing EFF until faults occur.
We assume programs are written in Dedalus and that pre- and postconditions are expressed as special relations pre
and post
in the program.
Consider an interaction between Molly and a user trying to implement a fault-tolerant broadcast protocol between three nodes A
, B
, and C
where A
begins with a message to broadcast. Our correctness condition asserts that if a message is delivered to any non-failed node, it is delivered to all of them.
A
sends a copy of the message to B
and C
once. Molly drops the message from A
to B
inducing a bug.A
continually sends the message to B
and C
. Molly drops the message from A
to B
and then crashes A
. C
has the message but B
doesn't: a bug.Molly begins by rewriting a Dedalus program into a Datalog program.
Each relation is augmented with a time column.
foo(A,B) ==> foo(A,B,T)
foo(B,C) ==> bar(B,C,T)
baz(A,C) ==> baz(A,C,T)
The time column of every predicate in the body of a rule is bound to the same variable T
.
_ :- foo(A,B), bar(B,C) ==>
_ :- foo(A,B,T), bar(B,C,T)
The head of every deductive rule is bound to T
.
baz(A,C) :- foo(A,B), bar(B,C) ==>
baz(A,C,T) :- foo(A,B,T), bar(B,C,T)
The head of every inductive rule is bound to T + 1
.
baz(A,C) :- foo(A,B), bar(B,C) ==>
baz(A,C,T+1) :- foo(A,B,T), bar(B,C,T)
For asynchronous rules, we introduce a Clock(From, To, T)
relation which contains an entry (n, m, T)
if node n
sent a message to m
at time T
. Then, the body of asynchronous rules at node n
whose heads are destined for node n
are augmented with a Clock(n, m, t)
predicate while the head is augmented with T + 1
. Molly can add and remove entries from Clock
to simulate faults.
foo(A,B) :- foo(B,A) ==>
foo(A,B,T+1) :- foo(B,A,T), Clock(B,A,T)
It then rewrites the Datalog program to maintain its own provenance graph, and extracts lineage from the graph via recursive queries that walk the graph.
Given an execution of the Datalog program, Molly generates a CNF formula where the disjuncts inside each conjunct x1 or .. or xn
represent a message drop or node failure that would invalidate a particular derivation. If all derivations can be invalidated, then the formula is unsatisfiable and the program is fault-tolerant. If the formula is satisfiable, then each satisfied conjunct represents a counterexample of the derivation. Molly uses a SMT solver to solve these formulas.