Realtime Data Processing at Facebook (2016)

There's an enormous number of stream (a.k.a. real-time, interactive) processing systems in the wild: Twitter Storm, Twitter Heron, Google Millwheel, LinkedIn Samza, Spark Streaming, Apache Flink, etc. While all similar, the stream processing systems differ in their ease of use, performance, fault-tolerance, scalability, correctness, etc. In this paper, Facebook discusses the design decisions that go into developing a stream processing system and discusses the decisions they made with three of their real-time processing systems: Puma, Swift, and Stylus.

Systems Overview. - Scribe. Scribe is a persistent messaging system for log data. Data is organized into categories, and Scribe buckets are the basic unit of stream processing. The data pushed into scribe is persisted in HDFS. - Puma. Puma is a real-time data processing system in which applications are written in a SQL-like language with user defined functions written in Java. The system is designed for compiled, rather than ad-hoc, queries. It used to compute aggregates and to filter Scribe streams. - Swift. Swift is used to checkpoint Scribe streams. Checkpoints are made every N strings or every B bytes. Swift is used for low-throughput applications. - Stylus. Stylus is a low-level general purpose stream processing system written in C++ and resembles Storm, Millwheel, etc. Stream processors are organized into DAGs and the system provides estimated low watermarks. - Laser. Laser is a high throughput, low latency key-value store built on RocksDB. - Scuba. Scuba supports ad-hoc queries for debugging. - Hive. Hive is a huge data warehouse which support SQL queries.

Example Application. Imagine a stream of events, where each event belongs to a single topic. Consider a streaming application which computes the top k events for each topic over 5 minute windows composed of four stages:

  1. Filterer. The filterer filter events and shards events based on their dimension id.
  2. Joiner. The joiner looks up dimension data by dimension id, infers the topic of the event, and shards output by (event, topic).
  3. Scorer. The scorer maintains a recent history of event counts per topic as well as some long-term counts. It assigns a score for each event and shards output by topic.
  4. Ranker. The ranker computes the top k events per topic.

The filterer and joiner are stateless; the scorer and ranker are stateful. The filterer and ranker can be implemented in Puma. All can be implemented in Stylus.

Language Paradigm. The choice of the language in which users write applications can greatly impact a system's ease of use:

Puma uses SQL, Swift uses Python, and Stylus uses C++.

Data Transfer. Data must be transferred between nodes in a DAG:

Facebook connects its systems with Scribe for the following benefits:

Processing Semantics. Stream processors:

  1. Proccess inputs,
  2. Generate output, and
  3. checkpoint state, stream offsets, and outputs for recovery.

Each node has

For state semantics, we can achieve

For output semantics, we can achieve

At-least-once semantics is useful when low latency is more important than duplicate records. At most once is useful when loss is preferred over duplication. Puma guarantees at least once state and output semantics, and Stylus supports a whole bunch of combinations.

State-saving Mechanisms. Node state can be saved in one of many ways:

Stylus can save to a local RocksDB instance with data asynchronously backed up to HDFS. Alternatively, it can store to a remote database. If a processing unit forms a monoid (identity element with associate operator), then input data can be processed and later merged into the remote DB.

Backfill Processing. Being able to re-run old jobs or run new jobs on old data is useful for a number of reasons:

To re-run processing on old data, we have three choices:

Puma and Stylus code can be run as either streaming or batch applications.