Network provenance is like provenance, but over the network. It lets you point at a piece of state in a distributed protocol and ask how it was derived. Network provenance is useful to debug, find performance bottlenecks, and so on. This paper makes the following contributions:
Declarative networks are distributed systems (or protocols) implemented using NDLog: a variant of Datalog that is very similar to Bloom.
Network provenance can be characterized by three axes:
The provenance of a tuple t
is represented as a graph (V,E)
where the vertex set V
consists of tuple vertices (representing base and intermediate tuples) and rule execution vertices (representing rule instantiations). An edge from tuple vertex t
to rule execution vertex r
signifies that t
is an input to r
. An edge from r
to t
represents that r
derived t
. Essentially, we take a proof tree
w(b,c)
------ [r2]
y(a,b) z(b,c)
-------------- [r1]
x(a)
and convert everything into a vertex:
[w(b,c)]
||
[r2]
||
[y(a,b)] [z(b,c)]
\\ //
[r1]
||
[x(a)]
Each tuple vertex foo(a, b, c)
is assigned a VID SHA-1(foo + a + b + c)
. Each rule execution vertex r@x
with children a, b, c
is assigned a RID SHA-1(r + x + a + b + c)
. Provenance information is then stored in two tables:
prov(@Loc, VID, RID, RLoc)
, andruleExec(@RLoc, RID, R, VIDList)
where
prov(@LOC, VID, RID, RLoc)
signifies that the tuple with VID VID
resides at LOC
and can be derived using the rule with RID RID
which resides at RLoc
; andruleExec(@RLoc, RID, R, VIDList)
signifies that the rule with RID RID
resides at RLoc
, has label R
, and has children with VIDs in VIDList
.With these two tables, we can answer provenance queries (more on this later). When tuples are sent from one node to another, they include the RID
and RLoc
of the rule that derived them. This approach is known as the reference-based approach to network provenance. In alternative approach, known as the value-based approach, each tuple carries around all entries of prov
and ruleExec
needed to form its provenance.
Given an NDLog program, ExSPAN rewrites each inference rule into a set of inference rules which maintains the prov
and ruleExec
rules. That is, whenever a tuple is inserted or deleted, the corresponding entries of prov
and ruleExec
are simultaneously updated. The specifics of the rewrite are complicated; see the paper for details.
Users express a provenance query in the form eProvQuery(@X, QID, VID, Ret)
which returns the provenance of tuple with VID VID
residing at X
to the node Ret
. The query is also given a unique identifier QID
. ExSPAN can answer a query like this with 10 NDLog rules (see paper for details). These ten rules are parametrized on three predicates---f_pEDB
, f_pIDB
, and f_pRULE
---which users can use to customize the result of the query. For example, they can use them to return provenance semirings or return a count of the number of derivations for a particular tuple.
ExSPAN performs optimizations to reduce the latency and bandwidth requirements of provenance queries.