This study aims to classify air quality based on PM1.0, PM2.5, and PM10 parameters using a Big Data Analytics approach with the Gradient-Boosted Tree Classifier (GBT) algorithm implemented on the PySpark framework. The dataset used was downloaded from OpenAQ, covering the period from April 14, 2021, to April 16, 2023, with a total of 1,048,154 entries, representing a large and complex volume of data. The research process includes data preprocessing to address data imbalance, dataset splitting for training and testing, and hyperparameter tuning using grid search and cross-validation to optimize model performance. By leveraging PySpark’s advantage in parallel processing of large data, the GBT model achieved an accuracy of 98.87%, precision of 99.00%, recall of 98.87%, and an F1-Score of 98.90%. This study demonstrates how Big Data Analytics can enhance efficiency and accuracy in air quality classification, contributing significantly to the development of real-time monitoring systems that support air pollution mitigation and data-driven policy-making.
Read full abstract