A Feature Selection and Classification Technique for Text Categorization

M R Girgis,A A Aly

doi:10.1142/s0218843003000826

Abstract

Text categorization is the automated assigning of documents to predefined categories based on their contents. It involves two main tasks — feature selection and document classification. This paper discusses the weak points of the text categorization technique developed by Maron and modified by Lewis. Then, it introduces a technique for text categorization that uses new formulas for feature selection and document classification. These formulas have been formulated to overcome the weak points of Maron's and Lewis' techniques. Also, the paper describes the design of an experimental text categorization system that is composed of the same set of processes as the MAXCAT system developed by Lewis. The paper presents and analyses the results of applying the system on a set of training and test documents by using Lewis' and the proposed formulas. In addition, a method for separately evaluating the effectiveness of feature selection is given. Finally, the impact of the feature set size on the effectiveness of the classification system is investigated, using the system and applying one of the proposed classification formulas with different feature set sizes.

Full Text