Abstract

This work deals with document classification. It is a supervised learning method (it needs a labeled document set for training and a test set of documents to be classified). The procedure of document categorization includes a sequence of steps consisting of text preprocessing, feature extraction, and classification. In this work, a self-made data set was used to train the classifiers in every experiment. This work compares the accuracy, average precision, precision, and recall with or without combinations of some feature selection techniques and two classifiers (KNN and Naive Bayes). The results concluded that the Naive Bayes classifier performed better in many situations.

Highlights

  • In text classification, usually the dimensionality of the feature vector is huge because the input document consists of vast data and many terms [1, 2]

  • Feature clustering is one effective technique in feature reduction, where similar features are grouped into one cluster and each cluster is treated as a feature [11, 12]

  • For text categorization labels are assigned for some documents from predefined categories

Read more

Summary

INTRODUCTION

Usually the dimensionality of the feature vector is huge because the input document consists of vast data and many terms [1, 2]. Feature extraction approaches are computationally more extensive and more effective than feature selection methods [9, 10]. The number of digital documents in the web is increasing, the number of terms (i.e. features) in those documents is quite large but only a few are informative. It is a severe problem which degrades the efficiency of Information Retrieval (IR) procedures. A better feature selection procedure reflects the effectiveness on classification and computational efficiency.

RELATED WORK
FEATURE WEIGHTENING
Chi Square
NGL Coefficient
Data Set 1
Data Set 2
Classifiers Performance and Results
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.