A Hybrid Dimension Reduction Technique for Document Clustering

Cynthia Marea Nebu,Sumy Joseph

doi:10.1007/978-3-319-28031-8_35

Abstract

The paper proposes a hybrid approach to reduce dimension in text classification problems, to overcome the issue of Curse of Dimensionality. This hybrid approach is a combination of Feature Selection (FS) and Feature Extraction (FE) methods, considering different aspects of feature relevance, to effectively reduce the dimension in large text datasets. It prevents feature selection biased in favor of a particular FS method. Many FS methods like Term Variance, Document Frequency, Information Gain, Shannons Entropy measure, Mean-Median and Mean Absolute Difference, were implemented and a comparative study was made on their performance when implemented in a hybrid system. The features selected by the individual FS methods are merged using three approaches, namely, Union, Intersection and Modified Union. The sub lists of features further undergo Feature Extraction by PCA, and the reduced feature sub list is clustered with k-means. Finally, the sentiment-score of the individual clusters are calculated using SentiWordNet database which gives the polarity of the data. The experiments were conducted on the benchmark datasets namely Reuters-21,578 and Classic4. The performance evaluation of the system made using the measures like precision, recall, f-score and accuracy shows that the proposed method has improved performance compared to its competitive methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Hybrid Dimension Reduction Technique for Document Clustering

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Research on Feature Selection and kNN Classification Method in Chinese Text Classification
Chao Xiao ... Ping Wu
-
Chao Xiao, et. al.Chao Xiao ... Ping Wu
01 Jan 2015
01 Jan 2015

A comparative study on feature selection in Chinese Spam Filtering
Yan Xu
-
Yan XuYan Xu
01 Oct 2012
01 Oct 2012

A Hybrid Feature Method for Handling Redundant Features in a Sentinel-2 Multidate Image for Mapping Parthenium Weed
Zolo Kiala ... John Odindi
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | VOL. 13
Zolo Kiala, et. al.Zolo Kiala ... John Odindi
01 Jan 2020
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | VOL. 13

A novel ensemble feature selection method through Type I fuzzy
Nazanin Zahra Joodaki ... Mohammad Bagher Dowlatshahi
-
Nazanin Zahra Joodaki, et. al.Nazanin Zahra Joodaki ... Mohammad Bagher Dowlatshahi
02 Mar 2022
02 Mar 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Hybrid Dimension Reduction Technique for Document Clustering

Abstract

Talk to us

Similar Papers