Summary. Decibel is like git for relational data.
Often, teams need to simultaneously query, analyze, clean, or curate a collection of data without clobbering each other's work. Currently, the best solution available involves each team member making copies of the entire database. This is undesirable for a number of reasons:
Version control solves these problems, but existing version control systems (e.g. git) have a couple of shortcomings when applied naively to large datasets:
Decibel manages datasets which are collections of relations, and each relation's schema includes an immutable primary key which is used to track the version of each row. Beginning with an initial snapshot of a dataset, users can check out, branch, and commit changes to the data set in a style very similar to git. When data is merged, non-conflicting changes to separate columns of the same row are both applied. If conflicting changes to the same column of the same row occur, one branch's changes take priority. Diffing two tables across two branches results in a table of insertions and deletions. Finally, data can be queried across versions using VQuel: a query language which is notably not SQL.
This paper describes three physical realizations of Decibel's logical data model.
N
tuples comes with a B
-bit bitmap for B
branches. In a branch-oriented approach, there are B
N
-bit bitmaps.The tuple-first representation is good for multi-version queries, the version-first representation is good for single-version queries, and the hybrid approach is good at both.
This paper also presents a versioned database benchmarking framework in the form of a set of branching strategies and characteristic queries to analyze Decibel and any future versioned databases.
Commentary. A git-like versioned relational database seems like a great idea! This paper focuses on how to implement and analyze such a database efficiently; it focuses less on the higher-level semantics of data versioning. Notably, two unanswered questions stuck out to me that seem like interesting areas for future research.
R(key, a, b)
with the invariant that a + b < 10
. Imagine branch b1
updates a tuple (0, 0, 0)
to (0, 9, 0)
and branch b2
updates it to (0, 0, 9)
. If these tuples are merged, the tuple becomes (0, 9, 9)
violating the invariant. In systems like git, users can manually inspect and verify the correctness of the merge, but if a large dataset is being merged, this becomes infeasible. I think more research is needed to determine if this strategy is good enough and won't cause problems in practice. Or if it is insufficient, what merging strategies can be applied on large datasets. Perhaps ideas could be borrowed from CRDT literature; if all columns were semilattices, the columns could be merged in a sane way.