How to extract predictive binary attributes from a categorical one

I C Lerman,J F Pinto Da Costa

doi:10.1007/978-3-642-72253-0_33

Abstract

In this work, we present new ways of dealing with categorical attributes; in particular, the methodology introduced here concerns the use of these attributes in binary decision trees. We consider essentially two main operations; the first one consists in using the joint distribution of two or more categorical attributes in order to increase the final performance of the decision tree; the second - and the most important - operation, concerns the extraction of relatively few predictive binary attributes from a categorical attribute; specially, when the latter has a large number of values. With more than two classes to predict, most of the present binary decision tree software needs to test an exponential number of binary attributes for each categorical attribute; which can be prohibitive. Our method, ARCADE, is independent of the number of classes to be predicted, and it starts by reducing significantly the number of values of the initial categorical attribute. This is done by clustering the initial values, using a hierarchical classification method. Each cluster of values will then represent a new value of a new categorical attribute. This new attribute will then be used in the decision tree, instead of the initial one. Nevertheless, not all of the binary attributes associated with this new categorical attribute will be used; only those which are predictive. The reduction in the complexity of the search for the best binary split is therefore enormous, as will be seen in the application that we consider; that is, the old and still lively protein secondary structure prediction problem.

Full Text