Active Job Monitoring in Pilots

Eileen Kuehn,Manuel Giffels,Andreas Petzold,Max Fischer,Christopher Jung

doi:10.1088/1742-6596/664/5/052019

Eileen Kuehn, Manuel Giffels + Show 3 more

Open Access

https://doi.org/10.1088/1742-6596/664/5/052019

Copy DOI

Abstract

Recent developments in high energy physics (HEP) including multi-core jobs and multi-core pilots require data centres to gain a deep understanding of the system to monitor, design, and upgrade computing clusters. Networking is a critical component. Especially the increased usage of data federations, for example in diskless computing centres or as a fallback solution, relies on WAN connectivity and availability. The specific demands of different experiments and communities, but also the need for identification of misbehaving batch jobs, requires an active monitoring. Existing monitoring tools are not capable of measuring fine-grained information at batch job level. This complicates network-aware scheduling and optimisations. In addition, pilots add another layer of abstraction. They behave like batch systems themselves by managing and executing payloads of jobs internally. The number of real jobs being executed is unknown, as the original batch system has no access to internal information about the scheduling process inside the pilots. Therefore, the comparability of jobs and pilots for predicting run-time behaviour or network performance cannot be ensured. Hence, identifying the actual payload is important. At the GridKa Tier 1 centre a specific tool is in use that allows the monitoring of network traffic information at batch job level. This contribution presents the current monitoring approach and discusses recent efforts and importance to identify pilots and their substructures inside the batch system. It will also show how to determine monitoring data of specific jobs from identified pilots. Finally, the approach is evaluated.

Highlights

The monitoring of resources to evaluate and maintain the quality of service, detect anomalies, or identify root causes for severe issues is an important topic in data centres operating a batch system
Discussion and final remarks The work undertaken at the GridKa data and computing centre addresses the monitoring of data streams at batch job level
The implementation is independent of the targeted batch system

Summary

Recent citations

- A scalable architecture for online anomaly detection of WLCG batch jobs E Kuehn et al. This content was downloaded from IP address 129.13.72.197 on 12/10/2017 at 12:44. 21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015) IOP Publishing. Journal of Physics: Conference Series 664 (2015) 052019 doi:10.1088/1742-6596/664/5/052019

Introduction

Logger Process

Findings

Interface Category