Combining dissimilarity spaces for text categorization

Roberto H.W Pinheiro,George D.C Cavalcanti,Ing Ren Tsang

doi:10.1016/j.ins.2017.04.025

Abstract

Text categorization systems are designed to classify documents into a fixed number of predefined categories. Bag-of-words is one of the most used approaches to represent a document. However, it generates high-dimensional sparse data matrix with a high feature-to-instance ratio. An aggressive feature selection can alleviate these drawbacks, but such selection degrades the classifier’s performance. In this paper, we propose an approach for text categorization based on Dissimilarity Representation and multiple classifier systems. The proposed system, Combined Dissimilarity Spaces (CoDiS), is composed of multiple classifiers trained on data from different dissimilarity spaces. Each dissimilarity space is a transformation of the original space that reduces the dimensionality, feature-to-instance ratio, and sparseness. Experiments using forty-seven text categorization databases show that CoDiS presents a better performance in comparison to literature systems.

Full Text