Data analysts who interact with OLAP databases issue a lot of GROUP BY queries over huge amounts of data. These aggregates are often too big to compute interactively, so data analysts to resort to computing them ahead of time.
Analysts can do so using the CUBE operator which builds a family of aggregates all at once. Given a set of grouping columns and a set of aggregates, the CUBE operator performs a GROUP BY on all subsets of the grouping columns. For example, the following query:
CUBE Product, Year, Customer
BY SUM(Sales)
will compute the sum of sales for all eight subsets of Product
, Year
, and Customer
. Each of these eight GROUP BYs is called a cuboid. The GROUP BY on all eight attributes is called the base cuboid. This paper presents a heuristic-based algorithm to efficiently compute a cube. It also proves that finding an optimal cube is NP-hard.
Throughout this paper we assume that aggregate functions $F$ are homomorphic. That is given two sets of tuples $X$ and $Y$, $F(X \cup Y)$ = $F(X) \oplus F(Y)$. For example, sum is homomorphic, but median is not.
There are three methods to computing cuboids:
This paper only discusses how to compute cuboids using sorting. Sorting is desirable because a cuboid derived from a sorted cuboid parent is itself sorted. Hash based methods exist too but are not discussed here.
First, a bit of terminology. Consider cuboid $S = (A_1, \ldots, A_{l-1}, A_{l+1}, \ldots, A_j)$ computed from $B = (A_1, \ldots, A_j)$.
Note that cuboids are formed as a union of disjoint partitions, and cuboids are totally ordered.
The overlap method computes and sorts the base cuboid. It then derives every other cuboid using a parent. If a cuboid's partitions fit in memory, we can create the cuboid in a single pass partition by partition. For example, consider the following parent cuboid $(A, B, C)$ from which we want to compute cuboid $(A, C)$.
A | B | C |
---|---|---|
1 | 1 | 3 |
1 | 1 | 4 |
1 | 2 | 1 |
1 | 2 | 2 |
2 | 3 | 9 |
2 | 3 | 4 |
We scan the first partition $(1, 3)$, $(1, 4)$, $(1, 1)$, $(1, 2)$ and sort it to get $(1, 1)$, $(1, 2)$, $(1, 3)$, and $(1, 4)$. We than scan and sort the second partition and so on.
Every cuboid (except the base cuboid) has multiple parents to choose from. Ideally, we pick the parent cuboid which yields the fewest partitions. For example, $(A, C, D)$ is a much better parent than $(A, B, C)$ to build the cuboid $(A, C)$. As a heuristic, we choose the parent whose stripped attribute is furthest right.
As mentioned above, if the largest partition of a cuboid fits in memory, then we can compute a cuboid in a single pass partition by partition. We build up a sorted partition and update aggregates along the way. Once a partition is built, we can immediately emit it for our children.
If a cuboid's permissions do not fit in memory, then we instead use a single buffer page to write sorted runs to disk. Later, we merge all sorted runs together using external sort.
Given a fixed amount of memory, we have to decide which cuboids to compute on a given pass and whether to compute them with partitions or sorted runs. Finding an optimal scheduling is NP-hard (see next) section. As a heuristic, we can perform a breadth first search on the parent tree going from left to right.
The paper performs a reduction from the knapsack problem to a version of the knapsack problem with $2^m$ objects for some $m$. They then reduce this version of the knapsack to the problem of finding an optimal scheduling for cuboid computation. See paper for details; it's complicated.