BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud (2010)

Summary. Programming distributed systems is hard. Really hard. This paper conjectures that data-centric programming done with declarative programming languages can lead to simpler distributed systems that are more correct with less code. To support this conjecture, the authors implement an HDFS and Hadoop clone in Overlog, dubbed BOOM-FS and BOOM-MR respectively, using orders of magnitude fewer lines of code that the original implementations. They also extend BOOM-FS with increased availability, scalability, and monitoring.

An HDFS cluster consists of NameNodes responsible for metadata management and DataNodes responsible for data management. BOOM-FS reimplements the metadata protocol in Overlog; the data protocol is implemented in Java. The implementation models the entire system state (e.g. files, file paths, heartbeats, etc.) as data in a unified way by storing them as collections. The Overlog implementation of the NameNode then operates on and updates these collections. Some of the data (e.g. file paths) are actually views that can be optionally materialized and incrementally maintained. After reaching (almost) feature parity with HDFS, the authors increased the availability of the NameNode by using Paxos to introduce a hot standby replicas. They also partition the NameNode metadata to increase scalability and use metaprogramming to implement monitoring.

BOOM-MR plugs in to the existing Hadoop code but reimplements two MapReduce scheduling algorithms: Hadoop's first-come first-server algorithm, and Zaharia's LATE policy.

BOOM Analytics was implemented in order of magnitude fewer lines of code thanks to the data-centric approach and the declarative programming language. The implementation is also almost as fast as the systems they copy.