Boosting Performance of Data-intensive Analysis Workflows with Distributed Coordinated Caching

C Heidecker,M J Schnepf,G Quast,R F Von Cube,M Giffels,M Sauter

doi:10.1088/1742-6596/1525/1/012065

C Heidecker, M J Schnepf + Show 4 more

Open Access

https://doi.org/10.1088/1742-6596/1525/1/012065

Copy DOI

Abstract

Data-intensive end-user analyses in high energy physics require high data throughput to reach short turnaround cycles. This leads to enormous challenges for storage and network infrastructure, especially when facing the tremendously increasing amount of data to be processed during High-Luminosity LHC runs. Including opportunistic resources with volatile storage systems into the traditional HEP computing facilities makes this situation more complex.Bringing data close to the computing units is a promising approach to solve throughput limitations and improve the overall performance. We focus on coordinated distributed caching by coordinating workows to the most suitable hosts in terms of cached files. This allows optimizing overall processing efficiency of data-intensive workows and efficiently use limited cache volume by reducing replication of data on distributed caches.We developed a NaviX coordination service at KIT that realizes coordinated distributed caching using XRootD cache proxy server infrastructure and HTCondor batch system. In this paper, we present the experience gained in operating coordinated distributed caches on cloud and HPC resources. Furthermore, we show benchmarks of a dedicated high throughput cluster, the Throughput-Optimized Analysis-System (TOpAS), which is based on the above-mentioned concept.

Highlights

The performance of data-intensive workflows is limited by the data transfer rate [1]
We show benchmarks of a dedicated high throughput cluster, the Throughput-Optimized Analysis-System (TOpAS), which is based on the above-mentioned concept
If the throughput for accessing remote data is limited, repeatedly accessing cached input data leads to an overall optimization of data throughput, and, the CPU efficiency of workflows waiting for data

Summary

Introduction

The performance of data-intensive workflows is limited by the data transfer rate [1]. 2. Coordinating distributed caches We focus on increasing the efficiency of the WLCG Tier 3 computing infrastructure for HEP workflows including institute and opportunistic resources. By scheduling of jobs to resources, we directly influence the placement of data in caches, and, coordinate data in distributed caches. It adds job and data coordination to an HTCondor batch system [4] using an XRootD [5] caching infrastructure. We showed that NaviX successfully coordinates jobs to the most suitable hosts in terms of data locality. It reduced the duplication of data in distributed caches and improved the overall data throughput by coordinating jobs to the cached data. We present benchmark results of a dedicated high throughput cluster, an HPC cluster, and a cloud resource when using distributed caches coordinated by NaviX

A dedicated high throughput cluster

Conclusion