Abstract

Analysis of large geographically distributed scientific datasets, also referred to as distributed data-intensive science, has emerged as an important area in recent years. An application that processes data from a remote repository needs to be broken into several stages, including a data retrieval task at the data repository, a data movement task, and a data processing task at a computing site. Because of the volume of data that is involved and the amount of processing, it is desirable that both the data repository and computing site may be clusters. This can further complicate the development of such data processing applications. In this paper, we present a middleware, FREERIDE-G (FRamework for Rapid Implementation of Datamining Engines in Grid), which support a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. Particularly, we had the following goals behind designing the FREERIDE-G middleware: 1) Support high-end processing, i.e., use parallel configurations for both hosting the data and processing the data, 2) Ease use of parallel configurations, i.e., support a high-level API for specifying the processing, and 3) Hide details of data movement and caching. We have evaluated our system using three popular data mining algorithms and two scientific data analysis applications. The main observations from our experiments are as follows. First, FREERIDE-G is able to scale the processing extremely well when the number of data server and compute nodes are scaled evenly. Second, when only the number of compute nodes are scaled, our target class of applications achieve modest additional speedups. Finally, for applications that involve multiple passes on the dataset, caching remote data provides significant improvement.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.