Document Categorization Based on Minimum Loss of Reconstruction Information

Juan Carlos Gomez,Marie-Francine Moens

doi:10.1007/978-3-642-37798-3_9

Abstract

AbstractIn this paper we present and validate a novel approach for single-label multi-class document categorization. The proposed categorization approach relies on the statistical property of Principal Component Analysis (PCA), which minimizes the reconstruction error of the training documents used to compute a low-rank category transformation matrix. This matrix allows projecting the original training documents from a given category to a new low-rank space and then optimally reconstructs them to the original space with a minimum loss of information. The proposed method, called Minimum Loss of Reconstruction Information (mLRI) classifier, uses this property, extends and applies it to unseen documents. Several experiments on three well-known multi-class datasets for text categorization are conducted in order to highlight the stable and generally better performance of the proposed approach in comparison with other popular categorization methods.KeywordsSupport Vector MachineSingular Value DecompositionLinear Discriminant AnalysisReconstruction ErrorText CategorizationThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text