Summary. Many big data applications necessitate streaming (aka real-time, interactive, or low-latency) data processing. For example
There are many streaming data processing systems out there (e.g. Storm, TimeStream, MapReduce Online, S4) and most are based on the continuous operator model. In this model, computation is modelled as a network of stateful operators that receive data from an upstream operator, update their internal state, and send data to downstream operators. The continuous operator model typically employs one of two recovery mechanisms.
In this paper, Zaharia et al. introduce a new stream processing model, discretized streams (D-Streams), that recovers quickly, handles stragglers, scales to hundreds of nodes, and aims to support latencies of 0.5 - 2 seconds. In this model, streams are divided into stateless, determinstic batch computations on small time intervals. Determinism and lineage tracking allow for parallel recovery. Moreover, D-Streams are implemented in Spark Streaming allowing a unified interface between batch and streaming computation: something of great practical value.
In the discretized streams model, all the data for a given time interval is collected into an RDD, tasks then operate on the RDD, much like they would in Spark, to produce outputs (possibly to external systems) or temporary state that is fed into the next time interval.
The Spark Streaming architecture is divided between three main components:
Data is either periodically read from a data store like HDFS or is sent into the system by a client. Workers run an LRU block store to manage RDD partitions and coordinate with the master which tracks which nodes have which blocks. The system employs traditional data processing optimizations and also introduces optimizations to I/O, pipelining, scheduling, etc. Moreover, since streaming computations run 24/7, Spark Streaming had to introduce master recovery. Master periodically write their state to HDFS which can be recovered by another master.