Operation and maintenance are critical activities in the whole life cycle of modern online software systems, and anomaly detection is a crucial step of these activities. Recent studies mainly develop deep learning techniques to complete this task. Notably, though these techniques have achieved promising results in experimental evaluations, there are still several practicality gaps for them to be successfully applied in a real-world online system, including the scalability gap, availability gap and alignment gap. To bridge these gaps, we propose an anomaly detection framework, namely ShareAD , based on a pre-train-and-align paradigm. Specifically, we argue that pre-training a shared model for anomaly detection is an effective way to bridge the scalability gap and the availability gap. To support this argument, we systematically study the necessity and feasibility of model sharing for online system maintenance. We further propose a novel model based upon Transformer encoder layers and Base layers, which works well for anomaly detection pre-training. Then, to bridge the alignment gap, we propose ShareAD alignment to align the pre-trained model with operator preference by jointly considering the local observation context and sensitivity of each monitor entity. Extensive experiments on two real-world large-scale datasets demonstrate the effectiveness and practicality of ShareAD .
Read full abstract