A History and Evaluation of System R (1981)

Summary

Ed Codd proposed the relational model in 1970. As opposed to the navigational data models that came before it, the relational model boasted data independence: the ability for data storage and access methods to change independently of applications. Some worried that data independence necessitated poor performance. System R was one of the first relation databases and proved that the relational model could be implemented efficiently.

System R development proceeded in three phases. Phase 0 (1974-1975) was a single-user PL/I interpreter prototype that processed a subset of SQL (e.g. no joins) using the XRM access method. The Phase 0 prototype was always intended to be thrown away. Instead, the focus was on tuning the user interface SQL. User studies and interviews were performed to increase the usability and understandability of SQL. Every tuple in the database was labelled with a TID which contained a page number. Each tuple contained pointers into separate domains, and inversions existed to map domain values to TIDs. The Phase 0 query optimizer aimed to minimize the number of fetched tuples and would perform tricks like TID intersection to evaluate conjunctions. The prototype also introduced the design that the system catalog should be stored as relations. Phase 0 brought about the following ideas:

The optimizer should consider more than the cost of fetching tuples. It should also take into account the costs of TID manipulation, data fetching, etc.
Number of I/Os would have been a better metric than the number of tuples fetched. This would have also exposed the deficiency of the XRM access method.
The Phase 0 optimizer was CPU bound! This encouraged the later optimizer to be a weighted cost of CPU and I/O.
SQL joins are very important.
The query optimizer was complicated; more care should be given towards simpler and more common queries.

Phase 1 ranged from 1976 to 1977 and included the implementation of a full blown multi-user relational database. Phase 1 was divided into two pieces:

The Relational Data System (RDS) was an optimizing SQL processor responsible for query optimization.
The Research Storage System (RSS) was the access method that replaced XRM and was responsible for things like locking and logging.

Users could query System R using interactive queries or by embedding SQL queries in PL/I or Cobol. A preprocessor would compile the embedded SQL queries into an access module using a repository of hand-compiled fragments. Of course, the compiled query plan could be invalidated over time. For example, the query plan could use an index which is later dropped. Thus, each query's dependencies were put in the system catalog and queries were recompiled when their dependencies were invalidated.

Unlike the XRM, the RSS stored data directly in the tuples. This meant that certain column values were stored redundantly, but an entire row could be read in a single I/O. RSS also supported B+ tree indexes, tuple links, index scans, full table scans, link scans, tuple nested loop joins, index nested loop joins, and sort merge joins.

The query optimizer minimized a weighted sum of RSS calls and I/Os using a dynamic programming approach. It avoided using some of the TID list intersection tricks that the Phase 0 optimizer used.

Views were stored as parse trees and merged back into the SQL queries used to query them. Updates were only allowed on single-table views. Views were the atomic unit of authorization using a grant/revoke mechanism.

System R used a combination of logging and shadow pages to implement recovery. During recovery, pages were restored to their old shadow pages, and the log was processed backwards.

Since Phase 1 was a multi-user database, it introduced multiple granularity locking in the form of intension locks. Originally, it had predicate locking, but this was abandoned because it was (1) difficult to check for predicate disjointness, (2) predicates were sometimes falsely marked as overlapping, and (3) predicate locking broke the abstraction barrier of the RSS.

Phase 2 was a two-year period in which System R was evaluated. Users generally enjoyed the uniformity of SQL, and their recommendations led to the introduction of EXISTS, LIKE, prepared statements, and outer joins. The query optimizer was evaluated assuming that data was uniformly distributed and that all columns were independent. Shadow pages led to poor locality, extra bookkeeping, and semi-expensive page swapping. System R provided read uncommitted, read committed, and full serializable transactions. Read uncommitted wasn't implemented as fast as it should have been. Read committed had more overhead than expected. Serializable transactions ended up being the most commonly used.

Commentary

System R introduced a bevy of influential and perennial ideas in the field of databases. Unix introduced a bevy of influential and perennial ideas in the field of operating systems. It's no coincidence that there are a striking number of system design principles that System R and Unix---as presented in The Unix Time-Sharing System---share:

Unified Abstractions. Unix unified the file and I/O device abstraction into a single interface. System R unified the table and catalog/metadata API into a single interface (i.e. everything is a relation). System R also unifed SQL as the query language used for ad-hoc queries, program-embeded queries, and view definitions. System R's decision to use relations to represent the catalog can also be seen as a form of dogfooding.
Simple is Better. Unix started as Ken Thompson's pet project as an effort to make development simpler and more enjoyable. Unix's simplicity stuck and was one of its selling points. Similarly, System R spent a considerable amount of effort simplifying the SQL interface to make it as easy to use as possible. If a system's interface is too complicated, nobody will use it.
Performance isn't Everything. Thompson and Ritchie implemented Unix in C instead of assembly despite the fact that the kernel size increased by one third because C greatly improved the readability and maintainability of the system. Similarly, the System R paper comments that the relational model may never exceed the performance of a hand-optimized navigational database, but the abstraction it provides is worth the cost. Somewhat comically, today's compilers and query optimizers are so good, compiled C is likely smaller than hand-written assembly and optimized queries are likely faster than hand-optimized ones. This is an example of another systems principle of favoring higher-level declarative APIs which leave room for optimization.