Abstract

The paper proposes a hybrid approach to reduce dimension in text classification problems, to overcome the issue of Curse of Dimensionality. This hybrid approach is a combination of Feature Selection (FS) and Feature Extraction (FE) methods, considering different aspects of feature relevance, to effectively reduce the dimension in large text datasets. It prevents feature selection biased in favor of a particular FS method. Many FS methods like Term Variance, Document Frequency, Information Gain, Shannons Entropy measure, Mean-Median and Mean Absolute Difference, were implemented and a comparative study was made on their performance when implemented in a hybrid system. The features selected by the individual FS methods are merged using three approaches, namely, Union, Intersection and Modified Union. The sub lists of features further undergo Feature Extraction by PCA, and the reduced feature sub list is clustered with k-means. Finally, the sentiment-score of the individual clusters are calculated using SentiWordNet database which gives the polarity of the data. The experiments were conducted on the benchmark datasets namely Reuters-21,578 and Classic4. The performance evaluation of the system made using the measures like precision, recall, f-score and accuracy shows that the proposed method has improved performance compared to its competitive methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.