Chapter 1 -- Introduction. Database textbooks often focus on data structures and algorithms in the context of a single database component. This paper, as opposed most database textbooks, focuses instead on database architecture: the design and best practices of modern databases that are otherwise undocumented or exist only as tribal knowledge.
Modern relational database management systems (RDBMS) comprise five components, most of which are discussed in detail in this paper:
Chapter 2 -- Process Models. Database management systems have to handle multiple user requests concurrently. The process manager is responsible for mapping logical DBMS workers, which handle a DBMS client requests, to OS processes, OS threads, user-level threads, or some combination of the three. To simplify matters, this chapter discusses process models only for unikernels. There are three main process models:
The process manager is also responsible for admission control: the process by which a request is not serviced until sufficient resources are available. Without admission control, a database can start thrashing; for example, imagine a situation in which the working set of the database is larger than the buffer pool, and all I/Os become cache misses. Admission control provides graceful degradation; throughput should not decrease, and latency should scale proportionally with the number of requests. Typically a two-tier admission control policy is used:
Second, the execution admission controller runs after a query has been planned and delays execution until there are enough resources available to satisfy the query optimizer's estimated
The memory footprint is particularly important because it is most commonly the cause of thrashing.
Chapter 3 -- Parallel Architecture: Processes and Memory Coordination. Parallel hardware is ubiquitous. This chapter builds off the previous and explores process models for database systems with multiple cores and multiple machines.
Shared-Nothing. A shared-nothing system is a networked cluster of independent machines that share, well, nothing. In a shared-nothing system, all coordination and communication is left to the DBMS. Typically, tables are horizontally partitioned between machines. That is, each machine is assigned a subset of the tuples in each table using range partitioning, hash partitioning, round-robin partitioning, or some sort of hybrid partitioning. Each machine uses a shared memory model and receives queries, creates query plans, and execute queries as usual. The big difference is that queries are now evaluated on multiple machines at once, and the DBMS has to orchestrate the exchange of control and data messages. The database also has to implement very challenging distributed protocols like distributed deadlock detection and two-phase commit. Worse, by virtue of being a distributed system, shared-nothing architectures can experience partial failure which can be handled in one of many ways.
Despite the complexities that arise from a shared-nothing architecture, they achieve unparalleled scalability. 3. Shared Disk. In a shared disk system, processors all have access to a shared disk with roughly equal performance; they do not share RAM. Shared disk systems are much less complicated than shared-nothing systems because the failure of any machine does not lead to data unavailability. Still, shared disk systems require explicit coordination for in-memory data sharing including distributed lock managers, distributed buffer pools, etc. 4. NUMA. NUMA systems provide a shared memory model over a cluster of independent shared-nothing machines. NUMA clusters are dead, but the idea of non-uniform memory access lives on in shared-memory multiprocessors. In order to scale to a large number of cores, shared-memory NUMA processors organize CPUs and memories into pods where intra-pod memory access is fast but inter-pod memory access is slow. Databases may be able to ignore the non-uniformity of a NUMA multiprocessor, or they can employ certain optimizations:
- Memory should always be local to a processor.
- Tasks should be scheduled on the same processor it was previously run.
Almost all databases support shared memory systems, and most support either shared-disk or shared-nothing architectures as well.
Chapter 4 -- Relational Query Processor. The relational query processor is responsible for converting a textual query into an optimized dataflow query plan that's ready to be executed.
The first step in processing a query is that of query parsing and authorization. The query parser must check the syntactic well-formedness of a textual query, convert the text into an internal query representation, type check the query by resolving table and column references against information in the catalog, and perform any necessary authorization. Certain forms of authorization must be deferred to query execution. For example, a database may restrict a user's access to tuples from a table that satisfy a certain predicate. This row-level security depends on the values of the tuples and must be deferred to query execution. In fact, some authorization which could be performed during query parsing is deferred anyway. For example, deferring authorization checks to execution time allows for queries to be cached and reused between multiple clients with varying privileges.
Next, a query processor performs query rewrites: logical transformations that simplify and normalize a query without altering its semantics. Query rewrites include:
1 + R.a + 2 > 3
can be simplified to R.a > 0
.R.a < 0 AND R.a > 10
). Unsatisfiable predicates can be replaced with FALSE
which enable further simplifications and optimizations. In some distributed databases that horizontally partition tables, predicates can be used to reduce the number of servers that are contacted. For example, if a server is responsible for a partition of a table R
for all tuples where 0 < R.a < 100
, then it need not be contacted for a query like SELECT R.a FROM R WHERE R.a > 10000
. Finally, certain transitive predicates can be deduced. For example, the predicates R.a = S.b AND S.b = 100
imply R.a = 100
.Semantic optimization. Using semantic information from the database catalog can be used to further simplify queries. For example, consider an Employee
relation that has a foreign key into a Department
relation. With this information, the query
SELECT E.name
FROM Emp E, Department D
WHERE E.deptno = D.deptno
can be simplified to
SELECT E.name
FROM Emp E
Finally, a query is optimized. System R compiled queries into executable code. Later, System R developers regarded this as a mistake. Ingres compiled queries into interpretable dataflow diagrams. Later, Ingres developers somewhat ironically also regarded this as a mistake. Both compilation targets have their merits, but most modern databases have gone the way of Ingres to ensure portability. Typically, this involves optimizing individual SELECT-FROM-WHERE blocks into relational operator trees before stitching them all together. Optimizations involve:
Query optimizers also have to deal with query caching and recompilation. Many databases allow for queries to be parsed, compiled, and stored ahead of time. These prepared queries can also include placeholders that are filled in at runtime. Prepared statements occasionally need to be re-optimized when, for example, an index is dropped. Certain databases avoid re-optimization to ensure predictability over performance; others aggressively re-optimize to ensure the best performance. Prepared queries can improve the performance of an application, but preparing queries ahead of time can be burdensome. Consequently, databases also support query caching to reuse (parts of) queries without necessitating ahead-of-time preparation.
Once a query is parsed, rewritten, and optimized into a dataflow plan, it must be executed. Typically, query plans are implemented as a tree of iterators with exchange nodes thrown in for parallelism. These iterators typically operate over tuple references: tuples of tuple pointers and column offsets. The actual tuple data is either stored in the buffer pool (BP-tuples) or copied from the buffer pool into the heap (M-tuples). Using BP-tuples avoids copies but is hard to implement correctly and may lead to a page being pinned in the buffer pool for prohibitively long. Using M-tuples can lead to unnecessary copies but is much simpler to implement.
Data modification statements (e.g. INSERT, UPDATE, etc) are typically compiled into simple linear dataflow diagrams. However, care must be taken to avoid things like the Halloween problem in which updates invalidate the index iterators used to perform the updates.
Chapter 5 -- Storage Management. TODO
Chapter 6 -- Transactions: Concurrency Control and Recovery. TODO
Chapter 7 -- Shared Components. TODO