Discretized Streams: Fault-Tolerant Streaming Computation at Scale (2013)

Summary. Many big data applications necessitate streaming (aka real-time, interactive, or low-latency) data processing. For example

There are many streaming data processing systems out there (e.g. Storm, TimeStream, MapReduce Online, S4) and most are based on the continuous operator model. In this model, computation is modelled as a network of stateful operators that receive data from an upstream operator, update their internal state, and send data to downstream operators. The continuous operator model typically employs one of two recovery mechanisms.

  1. Hot replication. Operators are replicated and data is sent to every one of the replicas. Synchronization between replicas can ensure they produce equivalent results. Replication has reasonably fast recovery, but it doubles deployment cost (at least) and operator synchronization can slow down processing.
  2. Upstream backup. Operators buffer their outputs until downstream consumers checkpoint. When a downstream operator fails, a new operator is launched and its buffered inputs are sent from the upstream backup. Upstream backup doesn't greatly increase deployment cost, but recovery can be slow.

In this paper, Zaharia et al. introduce a new stream processing model, discretized streams (D-Streams), that recovers quickly, handles stragglers, scales to hundreds of nodes, and aims to support latencies of 0.5 - 2 seconds. In this model, streams are divided into stateless, determinstic batch computations on small time intervals. Determinism and lineage tracking allow for parallel recovery. Moreover, D-Streams are implemented in Spark Streaming allowing a unified interface between batch and streaming computation: something of great practical value.

In the discretized streams model, all the data for a given time interval is collected into an RDD, tasks then operate on the RDD, much like they would in Spark, to produce outputs (possibly to external systems) or temporary state that is fed into the next time interval.

The Spark Streaming architecture is divided between three main components:

  1. The master tracks lineage and schedules tasks.
  2. The workers receive data, store partitions, and execute tasks.
  3. The client injects data into the system.

Data is either periodically read from a data store like HDFS or is sent into the system by a client. Workers run an LRU block store to manage RDD partitions and coordinate with the master which tracks which nodes have which blocks. The system employs traditional data processing optimizations and also introduces optimizations to I/O, pipelining, scheduling, etc. Moreover, since streaming computations run 24/7, Spark Streaming had to introduce master recovery. Master periodically write their state to HDFS which can be recovered by another master.