TAP: Time-aware Provenance for Distributed Systems (2011)

Network provenance---provenance, but over the network---allows network administrators to query the history and derivation of a tuple generated by some distributed system written in NDLog. Applying provenance to distributed systems has a number of challenges:

  1. Transient and inconsistent state. When a tuple is inserted or deleted from a declarative network, the provenance information has to be updated. This update is orchestrated by multiple nodes over the network, so it can take a while. Provenance queries that occur during the update may see inconsistent results.
  2. Explanations for state changes. Traditional provenance can answer why a particular query exists right now, but it can't answer why the tuple has appeared, disappeared, or changed.
  3. Security without trusted nodes. It would be nice if we could answer provenance queries even if some nodes in the system have been compromised by an attacker.

Time-aware Provenance

TAP models provenance as a graph which is stored in a collection of horizontally partitioned relations, similar to Efficient Querying and Maintenance of Network Provenance at Internet-Scale. The difference is in the vertices and edges.

Vertices

Each TAP vertex is one of four forms:

Edges

Edges in a TAP provenance graph represent dataflow just like they do in traditional provenance graphs. There are also special update edges that capture non-dataflow changes. For example, imagine the rule

foo(min<A>) :- bar(A)

If a new entry bar(B) is added which is smaller than the existing minimum bar(A), then the bar(A) entry is removed. This removal cannot be captured by a traditional dataflow edge, but it can be captured by an update edge.

Derivations

As with network provenance, NDLog programs can be rewritten to maintain provenance information at runtime.

Provenance Maintenance

There are three ways to physically store provenance information over time (in addition to the naive approach of copying the provenance information in its entirety for every timestep).

  1. Provenance deltas. We can store provenance deltas over time. This reduces the storage overhead but increases the cost of querying provenance.
  2. Per-node input logs. If our programs are deterministic, we can log the inputs to a program and recreate the provenance information on the fly.
  3. System input logs. We can record the base tuples of the entire system and replay the entire system on the fly. We can also checkpoint to avoid re-doing massive amounts of work.

There are a couple of factors which influence which storage format is best:

Provenance Querying

The authors are working on a query language called TapQL (inspired by ProQL) that allows people to concisely query provenance. TapQL queries are evaluated hierarchically with three layers:

  1. Macroqueries. TapQL queries can ask for the provenance of more than a single tuple. A macroquery iterates over all candidate queries that match the TapQL query and issues microqueries for each.
  2. Microqueries. A microquery recursively collates provenance by walking the provenance information relations.
  3. Vertex queries. Vertex queries return the vertices and edges incident to a given vertex.

We can perform a couple of query optimizations:

Secure Provenance Querying

It's future work to implement TAP in such a way that provenance can be queried even when some nodes have been compromised.

Commentary