Collection-Oriented Data Flow Support for Scientific Workflows

Jun Qin,Thomas Fahringer

doi:10.1007/978-3-642-30715-7_6

Abstract

While control flow aspects of scientific workflows have been well studied, the data flow perspective, especially the collection-oriented data flow that is often required in scientific workflows, is not well supported in existing work. For example, when processing datasets in parallel loops, existing approaches commonly provide only simple methods to distribute entire datasets onto parallel loop iterations which frequently leads to performance losses due to unnecessary data transfers. In this chapter, we present a sophisticated solution for this problem by introducing a data collection concept and the corresponding collection distribution constructs, which are inspired by High Performance Fortran, however applied to scientific workflow applications. Based on these constructs, data flow can be modeled and controlled more accurately, such as mapping a portion of a dataset to an activity, and independently distributing multiple collections, not necessarily with the same number of elements, onto parallel loop iterations. Our solution reduces data duplication, optimizes data transfers, and simplifies the effort to port scientific workflow applications onto distributed systems. These concepts have been included in AWDL, and the corresponding runtime support has been implemented in ASKALON. We demonstrate our solution by applying it to real-world scientific workflow applications and reporting the performance results.

Full Text