Abstract

In predictive tasks like classification, Information Gain (IG) based Decision Tree is very popularly used. However, IG method has some inherent problems like its preference towards choosing attributes with higher number of distinct values as the splitting attribute in case of nominal attributes and another problem is associated with imbalanced datasets. Most of the real-world datasets have many nominal attributes, and those nominal attributes may have many number of distinct values. In this paper, we have tried to point out these characteristics of the datasets while discussing the performance of our proposed approach. Our approach is a variant of the traditional Decision Tree model and uses a new technique called Dispersion_Ratio, a modification of existing Correlation Ratio (CR) method. The whole approach is divided into two phases - firstly, the dataset is discretised using a discretization module and secondly, the preprocessed dataset is used to build a Dispersion Ratio based Decision Tree model. The proposed method does not prefer the attributes with many unique values and indifferent about class distribution. It performs better than previously proposed CR based Decision Tree (CRDT) Model since an efficient discretization module has been added with it. We have evaluated the performance of our approach on some benchmark datasets from various domains to demonstrate the effectiveness of the proposed technique and also compared our model with Information Gain, Gain Ratio and Gini Index based models. Result shows that the proposed model outperforms other models in majority of the cases that we have considered in our experiment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call