Correcting PM2.5 data from low-cost sensors using machine learning techniques

Pratyush Agrawal,Padmavati Kulkarni,Meenakshi Kushwaha,Srishti Srishti,Hrishikesh Gautam,Pratima Singh,Sreekanth Vakacherla

doi:10.5194/egusphere-egu23-3110

Abstract

Low-cost sensors (LCSs) used for measuring air quality have become popular because of their portability, affordability, and ease of operation. However, LCS data often have accuracy and bias issues that need to be addressed before using them for research. LCSs are, therefore, collocated with reference-grade instruments, and various statistical and machine learning (ML) approaches are used to correct the observed bias in data. In this study, collocation experiments were conducted in Bengaluru, India, for about nine months (December 2021 to August 2022). We used nine PM2.5 LCSs that were collocated with a beta attenuation monitor (BAM), which is certified by the United States Environmental Protection Agency (USEPA). Hourly averaged data from LCSs and BAM were used to train various ML correction models. The LCSs included in the study&#8212;Airveda, Atmos, Prana Air, BlueSky, Aurassure, Aerogram, PurpleAir, and Prkruti&#8212;are widely available in the Indian market. The ML models include support vector regression (SVR), decision tree (DT), random forest (RF), and eXtreme gradient boosting (XGBoost). For the LCSs used in the study, a total of 170 ML models were built to identify the best-performing correction model for each sensor. Model performances were evaluated based on the following metrics: mean absolute error (MAE), root mean square error (RMSE), and normalised RMSE (NRMSE). During the study period, the average hourly BAM concentration was ~32 &#181;g/m3. Hourly averaged PM2.5 from LCSs and BAM exhibited a linear relationship. The NRMSE values of the raw (uncorrected) LCSs PM2.5 with respect to BAM PM2.5 varied between 0.26 and 0.89 across various sensors. The Plantower-based LCSs (Atmos I, PurpleAir, and Aerogram) performed better, characterised by the lowest RMSE/NRMSE values. SVR was found to be the best-performing model for most of the sensors in correcting raw LCSs PM2.5 data. The NRMSE of the ML models&#8217; corrected LCSs PM2.5 was reduced by 46% to 74% across various sensors compared to the uncorrected LCSs PM2.5. As a case study, we also added black carbon (BC) data to our ML models, but no significant change (improvement by 6% RMSE) in performance was observed.

Full Text