Abstract

The feature selection is important to speed up the process of Automatic Text Document Classification (ATDC). At present, the most common method for discriminating feature selection is based on Global Filter-based Feature Selection Scheme (GFSS). The GFSS assigns a score to each feature based on its discriminating power and selects the top-N features from the feature set, where N is an empirically determined number. As a result, it may be possible that the features of a few classes are discarded either partially or completely. The Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative features from all the classes. However, it suffers in dealing with an unbalanced dataset having large number of classes. The distribution of features in these classes are highly variable. In this case, if an equal number of features are chosen from each class, it may exclude some important features from the class containing a higher number of features. To overcome this problem, we propose a novel Variable Global Feature Selection Scheme (VGFSS) to select a variable number of features from each class based on the distribution of terms in the classes. It ensures that, a minimum number of terms are selected from each class. The numerical results on benchmark datasets show the effectiveness of the proposed algorithm VGFSS over classical information science methods and IGFSS.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.