Speedy Transactions in Multicore In-Memory Databases (2013)

tl;dr Silo is a relational database that uses Masstrees and a variant of OCC to scale extremely well with the number of available cores by avoiding unnecessary shared-memory contention unless it’s absolutely necessary.

Computers nowadays have a lot of cores. AWS and Google Cloud Platform both offer instances with 96 vCPUs! Unfortunately, having 96x the number of vCPUs often doesn’t mean that our code will run 96x as fast. Many existing databases don’t scale well with the number of cores. Silo is a single-node main-memory database designed to scale extremely well with the number of cores. It does so using a novel optimistic concurrency control protocol that avoids all shared-memory contention whenever possible. It scales so well, that it can achieve 700,000 transactions per second on a 32-core machine running the TPC-C benchmark! That’s fast.

Data Model and Storage

Silo is a fully serializable relational database. Silo only supports one-off transactions: transactions which can be executed completely by the database as soon as they are received, without the need to interact with the client who issued the transaction. Silo stores data in a highly concurrent B+ tree variant, called a Masstree, keyed by a primary key. Secondary indexes, also Masstrees, map secondary attributes to a corresponding set of primary keys (as opposed to record ids). Masstrees have the nice property that reads do not require any shared-memory writes. Masstrees are not inherently versioned, but we’ll see that Silo will store multiple versions of a record in a Mastree to allow for snapshot transactions.

Concurrency Control

Silo assigns each transaction a 64-bit transaction id, and this transaction id is stored with every record that the transaction modifies (records also store pointers to previous versions of the same record). However, transaction id is a bit of misnomer. Silo transaction ids are quite a bit more than a simple id. They are divided into three segments:

Silo divides time into a sequence of epochs. The high-order bits of a transaction id include the epoch in which the transaction committed.
The middle-bits include a monotonically increasing transaction id.
The low-order bits are a status bits indicating whether the record is locked, whether it is tombstoned, and whether it is the latest version of the record.

Silo implements a variant of optimistic concurrency control. In the read phase, transactions perform all their reads and buffer all their writes. When a record is read, its transaction id is recorded for later. The validation and write phase proceed as follows:

# Sort the write set to prevent deadlock.
for record in sorted(write_set):
    lock(record)

for record, read_tid in read_set:
    if (record.tid != read_tid or
        not record.latest or
        (record.locked and record not in write_set)):
        abort()

commit_tid = generate_commit_tid(read_set, write_set, epoch)

for record, new_value in write_set:
    write(record, new_value, commit_tid)
    unlock(record)

The paper argues that this concurrency control mechanism is correct because it simulates two-phase locking. By checking that every record that was read in the read phase is untouched in the validation phase, it verifies that if we had been running two-phase locking, we could have grabbed and held the read locks in the read phase and held them until the validation phase. I think this a pretty good intuitive explanation, but I’m guessing a detailed proof of correctness would be quite a bit more complicated.

A thread generates a transaction id such that it’s guaranteed to be larger than the id of any item in the read or write set and larger than any transaction id previously generated by the thread. Within an epoch, transaction ids don’t reflect the serial ordering, but across epochs they do.

Other Database Bits

Reads. Reads in Silo do not require shared-memory writes. To read a record, a thread reads the record’s transaction id. Then, it reads the data. Then, it reads the transaction id again. If the transaction id did not change, then the record is valid. If it did change, then the thread loops and tries again.
Deletes. To enable snapshot reads, items are tombstoned in Silo instead of deleted. Tombstoned data is eventually cleaned up using garbage collection.
Inserts. Silo’s concurrency control protocol acquires locks on modified objects, so it must insert an object first and then get a lock on it, being careful that two transactions do not simultaneously insert a tuple with the same primary key.
Range queries and phantoms. Masstree nodes are timestamped much like records. In addition to recording a read and write set, threads also record a node set: the set of the nodes and node timestamps read during a range query. During validation, a thread checks that the node timestamps have not changed.
Garbage collection. Data from old epochs are periodically deleted by the threads.
Snapshot transactions. Transactions can read old versions in the Mastree for snapshot transactions.
Durability. Silo persists transactions one epoch at a time, not returning to the client until the end of the epoch. Only committed transactions are persisted, so only redo information is needed.