Often times, people have a bunch of legacy data in one schema and want to migrate the data into a new system with a new schema. Typically, this schema migration is done with hand-written one-off scripts which are finicky and hard to optimize. This paper presents an algorithm to semi-automatically infer SQL queries which can perform the migration with a small amount of feedback and interaction from a database administrator. The algorithm is implemented in a system called Clio.
Traditional schema mapping research focuses on generating a new schema to migrate to and simultaneously generating a query to perform the migration. Typically, this approach requires database administrators to specify schema assertions relating sets of values from the input database to sets of values in the target database (e.g. attribute $A$ is a subset of attribute $B$).
This paper takes an alternative approach and assumes that the target schema is already specified. Instead of relying on schema assertions, we rely on value correspondences which describes how to generate tuples in the target database from tuples in the source database. More formally, a value correspondence is a pair of
For example, a correspondence function may prepend the '#' character to employee ids and filter out the employee ids of all the fired employees.
Value correspondences are arguably easier to specify than schema assertions. Database administrators do not have to know database-wide schema relations in order to perform a migration. They just have to know how to construct tuples in the output database. Moreover, value correspondences are more amenable to a programming-by-example system which recommends schema mappings.
Consider PayRate(Rank, HourlyRate)
and WorksOn(Name, Proj, Hours, ProjRank)
source databases and a target database with a single salary field. A user can click on the HourlyRate
attribute and the Hours
attribute and specify that salary is the product of hourly rate and the number of hours. This value correspondence says how to generate output tuples from input tuples, but it doesn't specify which input tuples should be paired and run through the correspondence function. This bit is inferred by the schema mapping generator. In this example, it would infer that PayRate
and WorksOn
should be joined on Rank = ProjRank
.
Now consider that there is another Professor(Id, Name, Sal)
source relation and that a user specifies an identity value correspondence between Professor.Sal
and the salary field in the output database. The schema mapping generator should infer that the results from the previous joined should be unioned with the professors' salaries.
Generalizing a bit, the schema mapping generator should follow two heuristics:
There is a strong theoretical basis for the search space of schema mappings, but the paper omits the details. In short, there are two main classes of mappings:
First, a bit of notation:
Given a set $V$ of value correspondences and a target relation $T$, we only consider the value correspondences where $Attrs(f_i)$ is an attribute in $T$. We generate a multiple candidate sets of value correspondences where each candidate set is mapped to a single SELECT-FROM-WHERE query. All queries are then unioned together. The procedure proceeds in four phases:
Given a cover $\Gamma$ and an insertion or deletion of a value correspondence $\Delta V$, we want to output a new schema mapping query. The incremental update algorithm proceeds in three phases.
This algorithm can be tweaked to work for output relations which include relations within a column. Covers for every relation (nested or otherwise) are formed into a tree. Children queries may additionally need to join with parents. See paper for details.