Abstract

Missing data is a frequently encountered problem in environment research community. To facilitate the analysis and management of air quality data, for example, PM2.5concentration in this study, a commonly adopted strategy for handling missing values in the samples is to generate a complete data set using imputation methods. Many imputation methods based on temporal or spatial correlation have been developed for this purpose in the existing literatures. The difference of various methods lies in characterizing the dependence relationship of data samples with different mathematical models, which is crucial for missing data imputation. In this paper, we propose two novel and principled imputation methods based on the nuclear norm of a matrix since it measures such dependence in a global fashion. The first method, termed as global nuclear norm minimization (GNNM), tries to impute missing values through directly minimizing the nuclear norm of the whole sample matrix, thus at the same time maximizing the linear dependence of samples. The second method, called local nuclear norm minimization (LNNM), concentrates more on each sample and its most similar samples which are estimated from the imputation results of the first method. In such a way, the nuclear norm minimization can be performed on those highly correlated samples instead of the whole sample matrix as in GNNM, thus reducing the adverse impact of irrelevant samples. The two methods are evaluated on a data set of PM2.5concentration measured every 1 h by 22 monitoring stations. The missing values are simulated with different percentages. The imputed values are compared with the ground truth values to evaluate the imputation performance of different methods. The experimental results verify the effectiveness of our methods, especially LNNM, for missing air quality data imputation.

Highlights

  • During the last decades, a large amount of air quality data which reflect significant pollutant concentrations have been collected by air quality monitoring stations distributed over a certain area

  • The proposed global nuclear norm minimization (GNNM) and local nuclear norm minimization (LNNM) are compared with typical station mean (SM) and NN imputation methods

  • Different imputation method is applied on the incomplete data such that the missing values could be estimated based on the observed values

Read more

Summary

Introduction

A large amount of air quality data which reflect significant pollutant concentrations have been collected by air quality monitoring stations distributed over a certain area. Because of many uncontrollable factors, such as instrument faults, communication, and processing errors, these data often suffer from missing values or incomplete samples [1, 2] with different proportions, causing serious difficulties for subsequent data analysis and decision making. According to [7, 8], the missing data mechanism can be categorized into three cases: (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). For MCAR, the missing values are completely independent of each other and appear as a few isolated points. For MNAR, the occurrence of missing values has specific patterns, for example, the Journal of Sensors pattern caused by a long time malfunction of monitoring station. We mainly focus on the first two cases since the last case is too restrictive in realities [9]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call