Abstract
As the capacity of hardware systems has grown and workload consolidation has taken place, the volume of performance metrics and diagnostic data streams has outscaled the capability of people to handle these systems using traditional methods. As work of different types (such as database, batch, and Web processing), each in its own monitoring silo, runs concurrently on a single image (operating system instance), both the complexity and the business consequences of a single image failure have increased. This paper presents two techniques for generating actionable information out of the overwhelming amount of performance and diagnostic data available to human analysts. Failure scoring is used to identify high-risk failure events that may be obscured in the myriad system events. This replaces human expertise in scanning tens of thousands of records per day and results in a short, prioritized list for action by systems staff. Adaptive thresholding is used to drive predictive and descriptive machine-learning-based modeling to isolate and identify misbehaving processes and transactions. The attraction of this technique is that it does not require human intervention and can be reapplied continually, resulting in models that are not brittle. Both techniques reduce the quantity and increase the relevance of data available for programmatic and human processes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.