Abstract

Complex data mining algorithms are processed in multiple iterations, where output of one iteration is used as input for the subsequent iterations. Existing parallel programming frameworks, e.g., MapReduce, Pregel and Spark, adopt the breadth first search (BFS) strategy to process those iterative jobs. They invoke the user-defined functions for every key-value pair or vertex to produce all possible intermediate results for the next iteration. Such BFS strategy incurs high I/O overheads, because normally, the size of intermediate search results of BFS is exponential to the size of original data, making it impossible to maintain those intermediate results in memory. In this paper, we present a new type of parallel programming model, the stack-centric model, where all computations are defined for a stack maintained in the distributed shared memory. The stack can be adaptively split into multiple stacks and disseminated to different compute nodes for parallel processing. The most distinguished feature of the stack-centric model is its support for the depth first search (DFS) algorithm which incurs much less memory overhead than its BFS counterpart. The maximal memory usage of DFS algorithm is determined by the height of its search tree, and hence, it is possible to conduct the computation of DFS algorithm mostly in memory. Our stack-centric model is not a pure DFS framework. It supports the hybrid BFS and DFS algorithms by tuning the trade-off between memory usage and parallelism. To show the advantages of stack-centric model, we implement two algorithms, frequent pattern mining algorithm and DNA sequence matching algorithm, on both stack-centric model and Spark. The memory usage of stack-centric model is 10 times less than the Spark, resulting in a significant performance improvement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call