Traditionally, query compilation is done using template expansion. Fragments of a SQL query are pattern matched and substituted with one of a set of hand-written machine code fragments. Template expansion
LegoBase is a query engine written in Scala that avoids these pitfalls. LegoBase compiles the heck out of a query plan (also written in Scala) into a C program that can execute the query plan. LegoBase performs a number of optimizations that other query compilers do not do. Plus, it's written in a high-level language!
Consider the following SQL query:
SELECT *
FROM (SELECT S.D,
SUM(1-S.B) AS E,
SUM(S.A*(1-S.B)),
SUM(S.A*(1-S.B)*(1+S.C))
FROM S
GROUP BY S.D) T, R
WHERE R.Z = T.E AND R.B = 3
A traditional query compiler will likely miss the following optimization opportunities:
1-S.B
, S.A*(1-S.B)
) which can be eliminated.T
. Then, if the join is a hash join, another hash table will be created to store T
. We can materialize T
into a single hash table rather than materializing it twice.T
into a hash table, we can play some tricks and use an array instead.LegoBase makes heavy use of Lightweight Modular Staging (LMS): a library for compiling and optimizing (at runtime) high-level Scala fragments into lower-level Scala fragments. For example, this Scala code:
def snippet(x: Rep[Int]) = {
def compute(b: Boolean): Rep[Int] = {
if (b) 1 else x
}
compute(true)+compute(1==1)
}
is compiled to this:
def snippet(x1:Int): Int = {
2
}
and this code:
def snippet(x: Rep[Int]) = {
def compute(b: Rep[Boolean]): Rep[Int] = {
if (b) 1 else x
}
compute(x==1)
}
is compiled to this:
def apply(x3:Int): Int = {
val x4 = x3 == 1
val x5 = if (x4) { 1 } else { x3 }
x5
}
In short, LMS partially evaluates as much of the Scala code as it can while leaving the expressions of Rep[T]
to be computed by the compiled code.
LMS compiles Scala to Scala. LegoBase extends LMS by compiling LMS-produces Scala into C code. Moreover, LegoBase introduces a set of database-specific optimizations.
A LegoBase-equipped database begins by converting a SQL query into a query plan using an off-the-shelf query optimizer. It then takes the query plan and replaces each operator with a LegoBase operator. It then compiles the heck out of the query plan using LMS and other tricks to produce a C program which can execute the query.
This architecture also allows LegoBase to re-compile queries based on changing runtime information. For example, the compiled code does not need to have any conditionals checking whether a log
flag is set. Instead, LegoBase can recompile the code every time the log
flag is toggled.
Lowering and compiling LMS-produced Scala code is reasonably straightforward. High-level features like objects and inheritance are compiled away, while low-level features like loops and conditionals are translated in the obvious way. There are two bits of complication though:
Next, we take a look at the optimizations LegoBase provides.
Some have argued that a Volcano-style pull based model of execution incurs unnecessary overhead. A faster approach is for data to be repeatedly pushed upwards through a tree until it is materialized (e.g. in a hash table somewhere). LegoBase converts pull based operators into push based operators.
First, callees have to be converted to callers. Instead of the next
method returning a single tuple, next
instead repeatedly (i.e. in a while loop) generates a tuple t
and calls parent.next(t)
. Second, callers have to be converted to callees. The next
method is converted to a next(t: Record)
method which accepts a tuple pushed by a child.
As mentioned earlier, naive implementations of certain query plans can unnecessarily materialize a tuple in multiple places. LegoBase allows you to pattern match on a query plan and eliminate redundant materializations. For example, one pattern might match hash joins whose left child is an aggregate and replace it with a single operator that only materializes the aggregates once.
It's convenient to write operators using hash tables, but hash tables sometimes be a bit slow. LegoBase can convert hash tables to arrays whose size is set based on runtime estimates of query size. It can also inline hash and equality functions used by the hash table.
LegoBase can automatically convert operators that use a row-oriented data layout into operators that use a column-oriented data layout. If all queries are known ahead of time, the optimizer can even remove unneeded columns.