Automating fault detection in communication networks and distributed systems is a challenging process that usually requires the involvement of supporting tools and the expertise of system operators. Automated event monitoring and correlating systems produce event data that is forwarded to system operators for analyzing error events and creating fault reports. Machine learning methods help not only analyzing event data more precisely but also forecasting possible error events by learning from existing faults. This study introduces an automated fault detection system that assists system operators in detecting and forecasting faults. This system is characterized by the capability of exploiting bug knowledge resources at various online repositories, log events and status parameters from the monitored system; and applying bug analysis and event filtering methods for evaluating events and forecasting faults. The system contains a fault data model to collect bug reports, a feature and semantic filtering method to correlate log events, and machine learning methods to evaluate the severity, priority and relation of log events and forecast the forthcoming critical faults of the monitored system. We have evaluated the prototyping implementation of the proposed system on a high performance computing cluster system and provided analysis with lessons learned.
Read full abstract