Study on Data Cleaning Based on Improved K-Means Clustering and Error Analysis

Xiaoxuan Guo,Lei Yan,Wanlu Wu,Leping Sun,Shuai Han

doi:10.1109/ei250167.2020.9347009

Abstract

Building an integrated energy system requires collecting and analyzing energy data. Due to inevitable factors such as system failures and line maintenance, data will be missing or abnormal during the collection process and cannot be converted into useful information. This type of data is mainly divided into two categories: bad data that does not meet the energy characteristics and missing data. In order to solve the above problems, this paper proposes a data cleaning method based on improved K-Means clustering and error feedback to achieve data cleaning. First, the abnormal data is divided into two types: bad data and missing data. For bad data, this paper proposes an abnormal data recognition method based on improved K-Means clustering. Through the method of clustering validity test, Davies-Bouldin (DB) index is used to determine the optimal number of clusters; Aiming at the missing data, this paper proposes a combined interpolation method for abnormal data based on error feedback, which determines the imputation weight of a certain type of user through the calculation of the sample set. In order to verify the effectiveness of the proposed method, this paper selects 40 sets of electricity consumption data of 5 users in a certain park within four months. The first 20 groups are used as the sample set to determine the imputation weight, and the last 20 groups are used as the verification set for verification and comparison. The results show that the data recognition and interpolation method proposed in this paper has higher stability and reliability.

Full Text