Under the background of intelligent times based on information technology, the analysis and replacement of outliers become particularly critical for databases in the face of massive sequential data streams. In order to improve the effectiveness and practicability of the detection method, this paper determines the abnormal scores of data points in a specific data set by using the K-Sigma algorithm of the Monte Carlo method, adaptively adjusts the k value according to the abnormal scores, and marks the abnormal points.The advantages of the traditional k-sigma algorithm are fast operation speed and theoretical basis, while the disadvantages are that each index dimension is determined independently. However, in industrial production, due to the multi-index dimension of time series data, multiple size indicators of a data set are related to each other. Therefore, it is necessary to comprehensively consider whether an anomaly is abnormal by combining it with the data of other indicators. In addition, when there are no outliers in the data set, the traditional k-sigma algorithm is used to take the mean and variance of the data set as parameters. Due to the limitations of the data set itself, some data points will be mislabeled as outliers. Through the k-sigma algorithm based on the Monte Carlo method, we can effectively solve the above problems.The k-sigma algorithm based on the Monte Carlo method can generate a normal distribution according to the distribution of original data points, and extract a large quantity of data from the distribution to generate Monte Carlo data set. The Mahalanobis distance li from each sample point to the mean in the Monte Carlo data set is calculated and compared with the Mahalanobis distance lj from the samples to be detected to the mean in the original data set. According to the number of sample points satisfying li < lj in the Monte Carlo data set, the value of the parameter kj is determined adaptively, and thus the outliers are determined. We implemented the k-sigma algorithm based on the Monte Carlo method through python, evaluated the effectiveness of the algorithm by accuracy, recall rate, and F1 score, and compared it with some machine learning algorithms. The results verified the feasibility and effectiveness of the algorithm, which can be used for real-time anomaly detection in the energy management database.
Read full abstract