Wide-area Data Transfers Research Articles

Many large science projects rely on remote clusters for (near) real-time data processing, thus they demand reliable wide-area data transfer performance for smooth end-to-end workflow executions. However, data transfers are often exposed to performance variations due to the changing network (e.g., background traffic) and dataset (e.g., average file size) conditions, necessitating adaptive solutions to meet stringent performance requirements of delay-sensitive streaming workflows. In this article, we propose FStream++ to provide reliable transfer performance for large streaming science applications by dynamically adjusting transfer settings to adapt to changing transfer conditions. FStream++ combines three optimization methods as dynamic tuning , online profiling , and historical analysis to swiftly and accurately discover optimal transfer settings that can meet workflow requirements. Dynamic tuning uses a heuristic model to predict the values of transfer parameters based on dataset characteristics and network settings. Since heuristic models fall short to incorporate many important factors such as I/O throughput and resource interference, we complement it with online profiling to execute a real-time search for a subset of transfer settings. Finally, historical analysis takes advantage of the long-running nature of streaming workflows by storing and analyzing previous performance observations to shorten the execution time of online profiling. We evaluate the performance of FStream++ by transferring several synthetic and real-world workloads in high-performance production networks and show that it offers up to <inline-formula><tex-math notation="LaTeX">$3.6x$</tex-math></inline-formula> performance improvement over legacy transfer applications and up to 24% over our previous work FStream .

Read full abstract

Wide area data transfer may be a major bottleneck for the end-to-end performance of distributed applications. A practical way of increasing the wide area throughput at the application layer is using multiple parallel streams. Although increased number of parallel streams may yield much better performance than using a single stream, overwhelming the network by opening too many streams may have an inverse effect. The congestion created by excess number of streams may cause a drop down in the throughput achieved. Hence, it is important to decide on the optimal number of streams without congesting the network. Predicting this "optimum” number is not straightforward, since it depends on many parameters specific to each individual transfer. Generic models that try to predict this number either rely too much on historical information or fail to achieve accurate predictions. In this paper, we present a set of new models which aim to approximate the optimal number with least history information and lowest prediction overhead. An algorithm is introduced to select the best combination of historic information to do the prediction for evaluation purposes as well as optimizing prediction by reducing error rate. We measure the feasibility and accuracy of the proposed prediction models by comparing to actual GridFTP data transfer by using little historical information and have seen that we could predict the throughput of parallel streams accurately and find a very close approximation of the optimal stream number.

Read full abstract

Wide-area Data Transfers Research Articles

Articles published on Wide-area Data Transfers

Reliable Wide-Area Data Transfers for Streaming Workflows

DLS: a cloud-hosted data caching and prefetching service for distributed metadata access

Prediction of Optimal Parallelism Level in Wide Area Data Transfers

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Wide-area Data Transfers Research Articles

Articles published on Wide-area Data Transfers

Reliable Wide-Area Data Transfers for Streaming Workflows

DLS: a cloud-hosted data caching and prefetching service for distributed metadata access

Prediction of Optimal Parallelism Level in Wide Area Data Transfers