Abstract

In order to improve the data quality, the big data cleaning method of distribution network was studied in this paper. First, the Local Outlier Factor (LOF) algorithm based on DBSCAN clustering was used to detect outliers. However, due to the difficulty in determining the LOF threshold, a method of dynamically calculating the threshold based on the transformer districts and time was proposed. Besides, the LOF algorithm combines the statistical distribution method to reduce the misjudgment rate. Aiming at the diversity and complexity of data missing forms in power big data, this paper improved the Random Forest imputation algorithm, which can be applied to various forms of missing data, especially the blocked missing data and even some horizontal or vertical data completely missing. The data in this paper were from real data of 44 transformer districts of a certain 10kV line in distribution network. Experimental results showed that outlier detection was accurate and suitable for any shape and multidimensional power big data. The improved Random Forest imputation algorithm was suitable for all missing forms, with higher imputation accuracy and better model stability. By comparing the network loss prediction between the data using this data cleaning method and the data removing outliers and missing values, it was found that the accuracy of network loss prediction had been improved by nearly 4 percentage points using the data cleaning method mentioned in this paper. Additionally, as the proportion of bad data increased, the difference between the prediction accuracy of cleaned data and that of uncleaned data was greater.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call