Abstract

Pollutant forecasting is an important problem in the environmental sciences. Data mining is an approach to discover knowledge from large data. This paper tries to use data mining methods to forecast ?PM?_(2.5) concentration level, which is an important air pollutant. There are several tree-based classification algorithms available in data mining, such as CART, C4.5, Random Forest (RF) and C5.0. RF and C5.0 are popular ensemble methods, which are, RF builds on CART with Bagging and C5.0 builds on C4.5 with Boosting, respectively. This paper builds ?PM?_(2.5) concentration level predictive models based on RF and C5.0 by using R packages. The data set includes 2000-2011 period data in a new town of Hong Kong. The ?PM?_(2.5) concentration is divided into 2 levels, the critical points is 25µg/m^3 (24 hours mean). According to 100 times 10-fold cross validation, the best testing accuracy is from RF model, which is around 0.845~0.854.

Highlights

  • Air pollution is a major problem for some time

  • According to 100 times 10-fold cross validation, the best testing accuracy is from Random Forest (RF) model, which is around 0.845~0.854

  • Because the target data is from a new town in Hong Kong, which means there are lots of people living in this area, so it is need to be a stricter standard of air pollution in such area

Read more

Summary

INTRODUCTION

Air pollution is a major problem for some time. Various organic and inorganic pollutants from all aspects of human activities are added daily to the air. One of the most important pollutants is particulate matter. Particulate matter (PM) can be defined as a mixture of fine particles and droplets in the air and this can be characterized by their sizes. Because the target data is from a new town in Hong Kong, which means there are lots of people living in this area, so it is need to be a stricter standard of air pollution in such area. We try to build models for predicting day's concentration level by using two popular tree-based classification algorithms, which are, Random Forest (RF) [4-. While RF and C5.0 are ensemble methods based on CART and C4.5, and each of them has a bunch of basic decision trees in the model.

Methods
METHODOLOGY
DATA PREPARATION
EXPERIMENTS
RESULT
Comparison
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call