Building reliable and efficient data transfer and processing pipelines

T Kosar,M Livny,G Kola

doi:10.1002/cpe.969

Abstract

AbstractScientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end‐to‐end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Our framework provides a universal interface to different data transfer protocols and storage systems. It has sophisticated flow control and recovers automatically from network, storage system, software and hardware failures. We successfully used data pipelines to replicate and process three terabytes of DPOSS astronomy image dataset and several terabytes of WCER educational video dataset. In both cases, the entire process was performed without any human intervention and the data pipeline recovered automatically from various failures. Copyright © 2005 John Wiley & Sons, Ltd.

Full Text