- Overview
- Note: This paper describes how to efficiently materialize an entire cube. The “Implementing Data Cubes Efficiently” paper discusses how to decide what subset of a cube to materialize given a set of constraints. They are different.
- The query
CUBE Product, Year, Customer BY SUM(Sales)
calculates a group by query for every subset of {Product
, Year
, Customer`}.
- One of these group bys is called a cuboid. The group by with all attributes is called the base cuboid.
- Options for Computing the CUBE
- Note that the aggregate functions for the cube must be homomorphic (i.e. $F(X \cup Y) = F(X) \cup F(Y)$).
- Multiple Independent Group-By Queries (Independent Method): We compute and materialize the base cuboid and then compute all other cuboids using it. We can use a standard sort based or hash based group-by implementation.
- Hierarchy of Group-By Queries (Parent Method): We can compute a cuboid on attributes $X$ from a cuboid on attributes $Y \supset X$. In this method, we independently compute each cuboid using a parent cuboid.
- Overlap Method: We use the parent method but concurrently compute multiple cuboids at once. This method is the focus of the paper.
- Overview of the Overlap Method
- Consider the following cuboid (A, B, C) and assume we want to compute the cuboid (A, C). If memory permits, we will read in one partition of the cuboid into memory at a time, sort the cuboid, and then append it to disk. A partition is just a sequence of the cuboid that shares the same prefix up to but not including the dropped column in the child cuboid (i.e. B) with the dropped column dropped. Below, partitions are color coded.
- If a partition does not fit into memory, we can read in the partitions sorted run by sorted run. A sorted run is a subsequence of a partition that shares the same prefix up to and including the dropped column in the child cuboid (i.e. B) with the dropped column dropped. We write out each sorted run and then externally sort them to form a sorted partition. Below, sorted runs within a partition are separated with a thick black line.
A
|
B
|
C
|
1
|
1
|
2
|
1
|
1
|
3
|
1
|
2
|
2
|
2
|
1
|
3
|
2
|
3
|
2
|
3
|
3
|
1
|
- Choosing a Parent to Compute a Cuboid
- To compute the cuboid (B, C), we prefer the parent (B, C, D) over (A, B,
- to minimize the partition size.
- Thus, we choose the parent whose dropped attribute is furthest to the right.
- Choosing a Set of Cuboids for Overlapped Computation
- If a cuboid is evaluated partition-by-partition, then some of its children can be evaluated concurrently.
- Given a limited amount of memory, we need to choose which cuboids to evaluate in parallel. Finding an optimum schedule is NP-hard, but an eager approach walks breadth first left-to-right down the subset lattice.
- Some Important Issues
- Incorrect estimation: partition sizes may have been estimated incorrectly. At runtime,the memory allocated to each cuboid can be adjusted dynamically.
- Limiting the number of sorted runs: to limit the number of sorted run files, we an append sorted runs from one partition onto sorted runs from previous partitions. We need only limit the number of distinct values of the parent’s dropped column.
- Choosing an initial sort order: We want smaller cuboids to be on the right since they have bigger partitions. Or, we can put the attributes with the fewest number of distinct values on the right to limit the number of sorted runs.
- Questions
- Q: Design a scheme using hashing to compute a child cuboid from a parent cuboid.
- A: ???