A Method of Pruning and Random Replacing of Known Values for Comparing Missing Data Imputation Models for Incomplete Air Quality Time Series

Luis Alfonso Menéndez García,Marta Menéndez Fernández,David Fernández López,Violetta Sokoła-Szewioła,Antonio Bernardo Sánchez,Almudena Ortiz Marqués,Laura Álvarez De Prado

doi:10.3390/app12136465

Luis Alfonso Menéndez García, Marta Menéndez Fernández + Show 5 more

Open Access

https://doi.org/10.3390/app12136465

Copy DOI

Abstract

The data obtained from air quality monitoring stations, which are used to carry out studies using data mining techniques, present the problem of missing values. This paper describes a research work on missing data imputation. Among the most common methods, the method that best imputes values to the available data set is analysed. It uses an algorithm that randomly replaces all known values in a dataset once with imputed values and compares them with the actual known values, forming several subsets. Data from seven stations in the Silesian region (Poland) were analyzed for hourly concentrations of four pollutants: nitrogen dioxide (NO2), nitrogen oxides (NOx), particles of 10 μm or less (PM10) and sulphur dioxide (SO2) for five years. Imputations were performed using linear imputation (LI), predictive mean matching (PMM), random forest (RF), k-nearest neighbours (k-NN) and imputation by Kalman smoothing on structural time series (Kalman) methods and performance evaluations were performed. Once the comparison method was validated, it was determine that, in general, Kalman structural smoothing and the linear imputation methods best fitted the imputed values to the data pattern. It was observed that each imputation method behaves in an analogous way for the different stations The variables with the best results are NO2 and SO2. The UMI method is the worst imputer for missing values in the data sets.

Full Text