From centralized to distributed decision tree induction using CHAID and fisher's linear discriminant function algorithms

Jie Ouyang,Nilesh Patel,Ishwar K Sethi

doi:10.3233/idt-2011-0102

Abstract

The decision tree-based classification is a popular approach for pattern recognition and data mining. Most decision tree induction methods assume training data being present at one central location. Given the growth in distributed databases at geographically dispersed locations, the methods for decision tree induction in distributed settings are gaining importance. This paper extends two well-known decision tree methods for centralized data to distributed data settings. The first method is an extension of CHAID algorithm and generates single feature based multi-way split decision trees. The second method is based on Fisher's linear discriminant (FLO) function and generates multifeature binary trees. Both methods aim to generate compact trees and are able to handle multiple classes. The suggested extensions for distributed environment are compared to their centralized counterparts and also to each other. Theoretical analysis and experimental tests demonstrate the effectiveness of the extensions. In addition, the side-by-side comparison highlights the advantages and deficiencies of these methods under different settings of the distribution environments.

Full Text