Study of feature selection algorithms for text-categorization

Kandarp M Dave

doi:10.34917/3274698

Abstract

This paper will discuss feature selection algorithms for text-categorization. Feature selection algorithms are very important, as they can make-or-break a categorization engine. The feature selection algorithms that will be discussed in this paper are Document Frequency, Information Gain, Chi Squared, Mutual Information, NGL (NgGoh-Low) coefficient, and GSS (GalavottiSebastiani-Simi) coefficient. The general idea of any feature selection algorithm is to determine the importance of words using some measure that can keep informative words, and remove non-informative words, which can then help the text-categorization engine categorize a document, D, into some category, C. These feature selection methods are explained, implemented, and are provided results for in this paper. This paper also discusses how we gathered and constructed training and testing data, along with the setup and storage techniques we used.

Full Text