Enhancing outlier detection in air quality index data using a stacked machine learning model

Abdoul Aziz Diallo,Lawrence Nderu,Bonface Miya Malenje,Gideon Mutie Kikuvi

doi:10.1002/eng2.12936

Abdoul Aziz Diallo, Lawrence Nderu + Show 2 more

https://doi.org/10.1002/eng2.12936

Copy DOI

Export

Save

Cite

Journal: Engineering Reports	Publication Date: May 30, 2024
License type: CC BY 4.0

Abstract
Full-Text
Similar Papers

Abstract

Listen

AbstractThe air quality index (AQI) is a commonly employed metric for evaluating air quality across diverse locations and temporal spans. Similar to other environmental datasets, AQI data can exhibit outliers data points markedly divergent from the norm, signifying instances of exceptionally favorable or adverse air quality. This becomes crucial in identifying and comprehending severe pollution episodes with far‐reaching environmental and public health implications. This study utilizes air quality data from January 1, 2014, to January 31, 2021, collected at daily intervals in Shanghai City, China, as the experimental dataset. The dataset includes daily AQI measurements, along with six pollutant concentrations: particulate matter (PM2.5 and PM10), sulfur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3), and carbon monoxide (CO). Each pollutant's concentration is measured in micrograms per cubic meter (g/m). The dataset is then preprocessed by cleaning and normalizing it before using K‐means clustering to discover different patterns. A stacked ensemble machine learning model that incorporates K‐means clustering, random forest (RF) and gradient boosting classifier (GBC) is developed and compared to decision tree, support vector machine, K‐nearest neighbor and Naive Bayes algorithms to evaluate its performance in identifying outliers using accuracy, precision, recall, and F1‐score. The stacked model outperformed all other established models based on the accuracy, precision, recall, and F1‐score of 0.99, 0.99, 0.97, and 0.99, respectively.

Full Text