A Generate-Test-Aggregate parallel programming library for systematic parallel programming

Yu Liu,Kento Emoto,Zhenjiang Hu

doi:10.1016/j.parco.2013.11.002

Abstract

The Generate-Test-Aggregate (GTA for short) algorithm is modeled following a simple and straightforward programming pattern, for combinatorial problems. First, generate all candidates; second, test and filter out invalid ones; finally, aggregate valid ones to make the final result. These three processing steps can be specified by three building blocks namely, generator, tester, and aggregator. Despite the simplicity of algorithm design, implementing the GTA algorithm naively following the three processing steps, i.e., brute-force, will result in an exponential-cost computation, and thus it is impractical for processing large data. The theory of GTA illustrates that if the definitions of generator, tester, and aggregator satisfy certain conditions, an efficient (usually near-linear cost) MapReduce program can be automatically derived from the GTA algorithm. The principle of GTA is attractive but how to make it being practically useful, remains as an important and challenge problem due to the complexity of GTA program transformations. In this paper, we report on our studying and implementation of a practical GTA library (written in the functional language Scala) which provides a systematic parallel programming approach for big-data analysis with MapReduce. The library provides a simple functional style programming interface and hides all the internal transformations. With this library, users can write parallel programs in a sequential manner in terms of the GTA algorithm, and the efficiency of the generated MapReduce programs is guaranteed systematically. Therefore, parallel programming for many problems could become no more a tough job. We demonstrate the usefulness of our GTA library on some interesting problems involving large data and show that lots of applications can be easily and efficiently solved by using our library.

Full Text