Abstract

Feature selection is an essential preprocessing step for classifiers with high dimensional training corpus. Features for text categorization include words, phrases, sentences or distribution of words. The complexity of classifying documents to related categories is on higher scale in comparison with unrelated categories. A feature selection algorithm based on chi-square statistics, have been proposed for Naive Bayes classifier. The proposed feature selection method identifies the related features for a class and determines the type of dependency between the feature and category. In this paper, the proposed method ascertains related phrases and words as features. A comparison of the conventional chi-square method is made with the proposed method. Experiments were conducted with randomly chosen training documents from one unrelated and five closely related categories of 20Newsgroup Benchmarks. It is observed that the proposed method has better precision and recall.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.