Data missing problems often occur on the Internet-of-Things domains. This article proposes a missing type-aware interpolation framework (IMA) for data loss problems in city-wide environmental monitoring systems that contain many scattered stations. To interpolate data as accurately as possible, IMA considers three aspects of information, i.e., spatiotemporal, all attributes of one measurement, and all values and accordingly develop three methods to estimate the missing data. First, we develop an improved multiviewer method, which uses the spatiotemporal correlation of data from neighbor stations to estimate random missing values. Second, we propose a new multi-eXtreme Gradient Boosting (multi-XGBoost) method that uses the values of the co-occurring and correlated correct attributes to predict the value of the missing attribute. Third, we take advantage of matrix factorization to estimate the missing parts if the data of the interpolation matrix are not all missing. To avoid the influence of uncorrelated data, IMA calculates Pearson's correlation coefficient between data of each station and uses those data from its top k highest correlation neighbors to form an interpolation matrix. Furthermore, due to the complexity of missing cases, IMA uses confidence levels in each of the three data prediction methods. For example, if the multiviewer method fails, IMA weights all valid results with confidence levels. We conduct our experiments on two real-world datasets from air quality monitoring stations in Beijing. Both datasets contain numerous missing measurements. Experimental results show that IMA outperforms other counterpart methods in interpolating the missing measurements, in terms of accuracy and effectiveness. Compared with the most related method, IMA improves the interpolation accuracy from 0.818 to 0.849 in a small dataset and from 0.214 to 0.759 in a large one.
Read full abstract