Network provenance---provenance, but over the network---allows network administrators to query the history and derivation of a tuple generated by some distributed system written in NDLog. Applying provenance to distributed systems has a number of challenges:
TAP models provenance as a graph which is stored in a collection of horizontally partitioned relations, similar to Efficient Querying and Maintenance of Network Provenance at Internet-Scale. The difference is in the vertices and edges.
Each TAP vertex is one of four forms:
Insert(n, $\tau$, t)
represents that tuple $\tau$ was created at node n
at time t
.Delete(n, $\tau$, t)
represents that tuple $\tau$ was deleted at node n
at time t
.Derive(n, $\tau$, R, t)
represents that tuple $\tau$ was derived using rule R
at node n
at time t
.Underive(n, $\tau$, R, t)
represents that tuple $\tau$ was underived using rule R
at node n
at time t
.Edges in a TAP provenance graph represent dataflow just like they do in traditional provenance graphs. There are also special update edges that capture non-dataflow changes. For example, imagine the rule
foo(min<A>) :- bar(A)
If a new entry bar(B)
is added which is smaller than the existing minimum bar(A)
, then the bar(A)
entry is removed. This removal cannot be captured by a traditional dataflow edge, but it can be captured by an update edge.
As with network provenance, NDLog programs can be rewritten to maintain provenance information at runtime.
There are three ways to physically store provenance information over time (in addition to the naive approach of copying the provenance information in its entirety for every timestep).
There are a couple of factors which influence which storage format is best:
The authors are working on a query language called TapQL (inspired by ProQL) that allows people to concisely query provenance. TapQL queries are evaluated hierarchically with three layers:
We can perform a couple of query optimizations:
It's future work to implement TAP in such a way that provenance can be queried even when some nodes have been compromised.
In Section 3, the paper makes the following claim. I don't understand how TAP remembers dependencies between tuples that existed at some point in the past. The paper never gave an example showing how a tuple remembers derivations from tuples in the past. Moreover, I don't understand how that helps prevent inconsistent queries that arise because of "transient state".
TAP also remembers dependencies between tuples that existed at some point in the past, which enables TAP to provide consistent answers to provenance queries even while the system is in a transient state.
Delete(c,minCost(@c,a,5),t3)
vertex. I'm confused how the new vertices provide anything new.Section 4 describes that because TAP provenance has an additional time field, we have to be clever about how we store provenance information in order to avoid storing a huge amount of data. Again, it's not clear to me why the time field increases the storage overhead.