Abstract
System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs and the complex interaction among system components make the traditional manual problem diagnosis and even automated line-by-line log analysis infeasible or ineffective. Sequence mining technologies aim to identify important patterns among a set of objects, which can help us discover regularity among events, detect anomalies, and predict events in HPC environments. The existing sequence mining algorithms are compute-intensive and inefficient to process the overwhelming number of system events which have complex interaction and dependency. In this paper, we present a novel, topology-aware sequence mining method (named TSM) and explore it for event analysis and anomaly detection on production HPC systems. TSM is resource-efficient and capable of producing long and complex event patterns from log messages, which makes TSM suitable for online monitoring and diagnosing of large-scale systems. We evaluate the performance of TSM using system logs collected from a production supercomputer. Experimental results show that TSM is highly efficient in identifying event sequences on single and multiple nodes without any prior knowledge. We apply verification functions and requirements and prove the correctness of the event patterns produced by TSM.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.