Speedy Transactions in Multicore In-Memory Databases (2013)

tl;dr Silo is a relational database that uses Masstrees and a variant of OCC to scale extremely well with the number of available cores by avoiding unnecessary shared-memory contention unless it’s absolutely necessary.

Computers nowadays have a lot of cores. AWS and Google Cloud Platform both offer instances with 96 vCPUs! Unfortunately, having 96x the number of vCPUs often doesn’t mean that our code will run 96x as fast. Many existing databases don’t scale well with the number of cores. Silo is a single-node main-memory database designed to scale extremely well with the number of cores. It does so using a novel optimistic concurrency control protocol that avoids all shared-memory contention whenever possible. It scales so well, that it can achieve 700,000 transactions per second on a 32-core machine running the TPC-C benchmark! That’s fast.

Data Model and Storage

Silo is a fully serializable relational database. Silo only supports one-off transactions: transactions which can be executed completely by the database as soon as they are received, without the need to interact with the client who issued the transaction. Silo stores data in a highly concurrent B+ tree variant, called a Masstree, keyed by a primary key. Secondary indexes, also Masstrees, map secondary attributes to a corresponding set of primary keys (as opposed to record ids). Masstrees have the nice property that reads do not require any shared-memory writes. Masstrees are not inherently versioned, but we’ll see that Silo will store multiple versions of a record in a Mastree to allow for snapshot transactions.

Concurrency Control

Silo assigns each transaction a 64-bit transaction id, and this transaction id is stored with every record that the transaction modifies (records also store pointers to previous versions of the same record). However, transaction id is a bit of misnomer. Silo transaction ids are quite a bit more than a simple id. They are divided into three segments:

  1. Silo divides time into a sequence of epochs. The high-order bits of a transaction id include the epoch in which the transaction committed.
  2. The middle-bits include a monotonically increasing transaction id.
  3. The low-order bits are a status bits indicating whether the record is locked, whether it is tombstoned, and whether it is the latest version of the record.

Silo implements a variant of optimistic concurrency control. In the read phase, transactions perform all their reads and buffer all their writes. When a record is read, its transaction id is recorded for later. The validation and write phase proceed as follows:

# Sort the write set to prevent deadlock.
for record in sorted(write_set):
    lock(record)

for record, read_tid in read_set:
    if (record.tid != read_tid or
        not record.latest or
        (record.locked and record not in write_set)):
        abort()

commit_tid = generate_commit_tid(read_set, write_set, epoch)

for record, new_value in write_set:
    write(record, new_value, commit_tid)
    unlock(record)

The paper argues that this concurrency control mechanism is correct because it simulates two-phase locking. By checking that every record that was read in the read phase is untouched in the validation phase, it verifies that if we had been running two-phase locking, we could have grabbed and held the read locks in the read phase and held them until the validation phase. I think this a pretty good intuitive explanation, but I’m guessing a detailed proof of correctness would be quite a bit more complicated.

A thread generates a transaction id such that it’s guaranteed to be larger than the id of any item in the read or write set and larger than any transaction id previously generated by the thread. Within an epoch, transaction ids don’t reflect the serial ordering, but across epochs they do.

Other Database Bits