A Multistage Feature Selection Model for Document Classification Using Information Gain and Rough Set

Mrs Leena,Dr Mohammed

doi:10.14569/ijarai.2014.031103

Abstract

Huge number of documents are increasing rapidly, therefore, to organize it in digitized form text categorization becomes an challenging issue. A major issue for text categorization is its large number of features. Most of the features are noisy, irrelevant and redundant, which may mislead the classifier. Hence, it is most important to reduce dimensionality of data to get smaller subset and provide the most gain in information. Feature selection techniques reduce the dimensionality of feature space. It also improves the overall accuracy and performance. Hence, to overcome the issues of text categorization feature selection is considered as an efficient technique . Therefore, we, proposed a multistage feature selection model to improve the overall accuracy and performance of classification. In the first stage document preprocessing part is performed. Secondly, each term within the documents are ranked according to their importance for classification using the information gain. Thirdly rough set technique is applied to the terms which are ranked importantly and feature reduction is carried out. Finally a document classification is performed on the core features using Naive Bayes and KNN classifier. Experiments are carried out on three UCI datasets, Reuters 21578, Classic 04 and Newsgroup 20. Results show the better accuracy and performance of the proposed model.

Highlights

In the field of data mining, it has been observed that the data grow rapidly
The reduct and core is an important concept of rough set theory which is generally used for feature selection and attribute reduction
A multistage feature selection model is proposed for document classification using information Gain and Rough set

Summary

INTRODUCTION

In the field of data mining, it has been observed that the data grow rapidly. With the rapid growth of data and the availability an increasing number of electronic documents, the task of classification becomes a key method [1]. It is important to reduce dimensionality of the data to smaller set of features and relevant information for decreasing the cost in storing and reduction in the processing time [6], [13]. Several dimension reduction techniques like PCA, GA, IG [19] are carried out; still the problem of time complexity and text categorization[16][24] can be improved In this we proposed multistage approaches: document preprocessing, feature selection and reduction technique which are used to reduce the high dimensionality of feature space. It removes the redundant and irrelevant attributes and thereby decreases the computational complexity of the machine learning process and increases the performance of classification. The results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure

PROPOSED MULTISTAGE MODEL

Feature Selection

A C can be defined as A positive region in the relation as

CLASSIFIER

Naive Bayes

K-Nearest Neighbor

EXPERIMENTAL EVALUATION

CONCLUSION