Abstract

Problematic I/O pattern is the major cause of low efficient HEP jobs. In a cluster with tens of thousands of job slots, locating the source of an anomalous workload is a nontrivial task. Automatic anomaly detection can largely shorten the recovery time of these situations and reduce manpower for problem diagnoses. This paper provides a data-driven approach to solve this problem. We design and implement an anomaly detection system based on Isolation Forest, a very efficient and scalable machine learning algorithm for spatial anomaly detection in high dimension space. Historical monitoring I/O patterns collected from the Lustre file system provides a sufficient statistical basis for model training and updating. Routine model updates and job tagging ensures adaptability and promptness. Web-based visualization and sorting tools facilitate the validation of model prediction. With the detection system, time spent on tracing the source of a problematic workload can be reduced from hours to minutes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call