A Chi-Square Based Splitting Criterion Better for the Decision Tree Algorithms

Laviniu Aurelian Badulescu

doi:10.1109/icstcc52150.2021.9607285

Abstract

In training a Decision Tree (DTr) the choice of the criterion according to which the selection of the attribute at the level of a node is made represents a key point. The Chi-Square (CS) measure is used in one of the best known DTr algorithms: CHAID. However, this criterion tends to favour attributes with more values. In this paper I try to show that a change in this criterion can improve its performance. I will present the results of experiments performed with DTr (unpruned and pruned) on seven databases. Along with the special importance of choosing the splitting criterion, the method of pruning DTr is also perhaps as important. For this, I wanted to highlight which of the three types of DTr: unpruned DTr, pessimistic pruned DTr or error based pruned DTr, has a better behaviour for problems of classification and prediction in Data Science field. The experiments exhibited in the paper show that the modified version of the CS criterion systematically achieves better performance of the classification error rate on the test data (CERTD). At the same time, the performances acquired by DTr pruning based on confidence intervals (error-based pruned) systematically exceed the performances of the other two variants of DTr.

Full Text