Global data analysis and the fragmentation problem in decision tree induction

Ricardo Vilalta,Larry Rendell,Gunnar Blix

doi:10.1007/3-540-62858-4_95

Abstract

We investigate an inherent limitation of top-down decision tree induction in which the continuous partitioning of the instance space progressively lessens the statistical support of every partial (i.e. disjunctive) hypothesis, known as the fragmentation problem. We show, both theoretically and empirically, how the fragmentation problem adversely affects predictive accuracy as variation ∇ (a measure of concept difficulty) increases. Applying feature-construction techniques at every tree node, which we implement on a decision tree inducer DALI, is proved to only partially solve the fragmentation problem. Our study illustrates how a more robust solution must also assess the value of each partial hypothesis by recurring to all available training data, an approach we name global data analysis, which decision tree induction alone is unable to accomplish. The value of global data analysis is evaluated by comparing modified versions of C4.5 rules with C4.5 trees and DALI, on both artificial and real-world domains. Empirical results suggest the importance of combining both feature construction and global data analysis to solve the fragmentation problem.KeywordsPredictive AccuracyTree NodeSplitting FunctionTarget ConceptFeature ConstructionThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text