The problem of air pollution has become a global issue that has received attention from various countries. Jakarta, Indonesia's capital city, is unavoidable from the same problem. This study will use four parameters of substances PM10, SO2, CO, O3, and nitrogen dioxide to categorize Jakarta's air quality (NO2). The data used is daily data taken from the Air Quality Monitoring Station in Jakarta throughout 2020. The methods used include SVM, Random Forest, Logistic Regression, KNN, CART, and Stacking Algorithm. At the data preparation stage, we found missing values, outliers, and class imbalance problems. Before applying machine learning methods and evaluating accuracy, we used data pre-processing techniques such as the MissForest method, median substitution, and ADASYN. The results prove that the original dataset has a higher accuracy score (0.882 – 0.977) than the balanced dataset (0.829 – 0.976). According to the evaluation results, the Random Forest method has the highest accuracy score for original and balanced datasets. The overall result is better than the identical research, which produces 96.61% accuracy using a neural network. It shows that preprocessing steps such as missing values handling an imbalanced class handling is essential in classification performance.
Read full abstract