Back in the day, like way back in the day, people thought parallel databases were never going to take off and that specialized hardware was the way of the future. They were wrong.
The relational model permits two forms of parallelism which parallel databases take advantage of:
There are two metrics by which we typically evaluate a parallel database.
n
times the number of machines, a task will take n
times less time.n
times the number of machines, the database can handle n
times the amount of load.There are three main overheads which prevent perfect speedup and scaleup:
We are unable to build an infinitely large, infinitely fast computer. The goal of parallelization is to simulate such a machine with an infinite amount of finite machines. There are currently three approaches:
The first two approaches require complex hardware and do not scale as well as the third. Thus, for large problems, shared-nothing architectures reign supreme.
The relational operator model of executing SQL queries lends itself nicely to pipeline parallelism, but unfortunately, there are some limitations:
There are a couple of ways to partition data across a number of nodes: round robin partitioning, hash partitioning, and range partitioning. Round robin partitioning is really only good if the only thing we're doing is full table scans. Hash partitioning helps spread data and avoid data skew problems but doesn't permit range queries. Range partitioning permits range queries is susceptible to data skew.
Many parallel database inject distribute query operators into their query plan to orchestrate data across multiple machines. The merge operator (really a union operator) merges in the results of a subquery being run on multiple nodes. The split operator takes a set of data and distributed it across multiple machines. For example, imagine joining two relations on a common attribute. The split operator might hash partition data to a set of nodes based on the join columns. These operators often implement some form of push back to avoid overloading an operator.
There are many state of the art (as of 1992) parallel databases including Teradata, Tandem Nonstop SQL, Gamma, Bubba, and Super Database Computer. Refer to the paper for details.
There is a lot of future work (in 1992) in mixed OLTP/OLAP workloads, distributed query optimization, parallel programs (e.g. Spark), and online data reorganization.