A Heuristic-Based Decision Tree Induction Method for Noisy Data

Nittaya Kerdprasop,Kittisak Kerdprasop

doi:10.1007/978-3-642-27157-1_1

Abstract

Decision tree is one of the most popular tools in data mining and machine learning to extract useful information from stored data. However, data repositories may contain noise, which is a random error in data. Noise in a data set can happen in different forms such as wrong labeled instances, erroneous attribute values, contradictory or duplicate instances having different labels. The serious effect of noise is that it can confuse the learning algorithm to produce a long and complex model. Such distorted result is due to the attempt to fit every training data instance, including noisy ones, into the model descriptions. This is a major cause of overfitting problem. Most decision tree induction algorithms apply either pre-pruning or post-pruning techniques during the tree induction phase to avoid growing a decision tree too deep down to cover the noisy data. We, on the contrary, design a loosely coupled approach to deal with noise. Our noise handling feature is in a separate phase from the tree induction. Both corrupted and uncorrupted data are clustered and heuristically selected prior to the application of the tree induction module. We observe from our experiments that upon learning from highly corrupted data, our approach shows a better performance than the conventional decision tree induction method.

Full Text