Abstract

AbstractThe problem of missing data in a database is something that causes frequent difficulties for its processing and analysis. This research presents a new missing data methodology based on multivariate adaptive regression splines (MARS) for missing data imputation. The performance of the proposed method is checked using as input information a database created from the hourly records of environmental stations located in the city of Madrid (Spain). Data analyzed corresponds to hourly measurements from 10th February 2004 to 31st May 2010. The proposed methodology has three variants. The first of these makes use of all the available information in order to calculate different MARS models with the ability to predict missing information based on the available data. In the second case, the MARS models are trained after the removal of 1% of the most extreme cases according to Mahalanobis’ distances, as they are considered outliers. Finally, the third model proposed makes use of the information corresponding only to the previous month in order to calculate the MARS models for the missing data prediction. The results obtained outperformed those given by multivariate imputation by chained equations (MICE) when applied to the same data sets. For a data set with 20% of its information missing, the proposed algorithm outperforms MICE in RMSE values at least in 65.5% of cases, MAE in 75.2% and MAPE in 76%.KeywordsMissing Data ImputationMultivariate Adaptive Regression Splines (MARS)Multiple Imputation by Chained Equations (MICE)Pollutants

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.