A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

Harun Uğuz

doi:10.1016/j.knosys.2011.04.014

Abstract

Text categorization is widely used when organizing documents in a digital form. Due to the increasing number of documents in digital form, automated text categorization has become more promising in the last ten years. A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier. Therefore, feature selection is often used in text categorization to reduce the dimensionality of the feature space and to improve performance. In this study, two-stage feature selection and feature extraction is used to improve the performance of text categorization. In the first stage, each term within the document is ranked depending on their importance for classification using the information gain (IG) method. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Thereby, during text categorization, terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; thus, the computational time and complexity of categorization is reduced. To evaluate the effectiveness of dimension reduction methods on our purposed model, experiments are conducted using the k-nearest neighbour (KNN) and C4.5 decision tree algorithm on Reuters-21,578 and Classic3 datasets collection for text categorization. The experimental results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Journal: Knowledge-Based Systems	Publication Date: Apr 29, 2011
Citations: 418

Similar Papers

A hybrid approach for text categorization by using x2 statistic, principal component analysis and particle swarm optimization

Scientific Research and Essays | VOL. 8

04 Oct 2013
Scientific Research and Essays | VOL. 8

Analysis and Evaluation of Feature Selection and Feature Extraction Methods
Rubén E Nogales ... Marco E Benalcázar
International Journal of Computational Intelligence Systems | VOL. 16
Rubén E Nogales, et. al.Rubén E Nogales ... Marco E Benalcázar
20 Sep 2023
International Journal of Computational Intelligence Systems | VOL. 16

Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization
Hamood Alshalabi ... Sabrina Tiun
Procedia Technology | VOL. 11
Hamood Alshalabi, et. al.Hamood Alshalabi ... Sabrina Tiun
01 Jan 2013
Procedia Technology | VOL. 11

Information-theoretic feature selection with segmentation-based folded principal component analysis (PCA) for hyperspectral image classification
Md Palash Uddin ... Md Ali Hossain
International Journal of Remote Sensing | VOL. 42
Md Palash Uddin, et. al.Md Palash Uddin ... Md Ali Hossain
10 Nov 2020
International Journal of Remote Sensing | VOL. 42

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems