Distributed processing of large data sets is an area that received much attention from re- searchers and practitioners over the last few years. In this context, several proposals exist that leverage the observation that often data sets evolve over time, and there is substantial overlap between the input to consecutive runs of a data processing job. This allows the programmers of these systems to devise an efficient logic to update the output upon an in- put change. However, most of these systems lack compatibility existing models and require the programmer to implement an application-specific dynamic algorithm, which increases algorithm and code complexity. In this chapter, we describe our previous work on building a platform called Incoop, which allows for running MapReduce computations incrementally and transparently. Incoop detects changes between two files that are used as inputs to consecutive MapReduce jobs, and efficiently propagates those changes until the new output is produced. The design of Incoop is based on memoizing the results of previously run tasks, and reusing these results whenever possible. Doing this efficiently introduces several technical challenges that are overcome with novel concepts, such as a large-scale storage system that efficiently computes deltas between two inputs, a contraction phase to break up the work of the Reduce phase, and an affinity- based scheduling algorithm. This chapter presents the motivation and design of Incoop, as well as a complete evaluation using several application benchmarks. Our results show significant performance improvements without changing a single line of application code.
|Title of host publication||Large Scale and Big Data|
|Subtitle of host publication||Processing and Management|
|Editors||Sherif Sakr, Mohamed Medhat Gaber|
|Place of Publication||Boca Raton, Florida, USA|
|Publication status||Published - 2014|