Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods

Abdulfattah Omar

doi:10.14569/ijacsa.2020.0110214

Abstract

The recent years have witnessed an increasing use of automated text clustering approaches and more particularly Vector Space Clustering (VSC) methods in the computational analysis of literary data including genre classification, theme analysis, stylometry, and authorship attribution. In spite of the effectiveness of VSC methods in resolving different problems in these disciplines and providing evidence-based research findings, the problem of feature selection remains a challenging one. For reliable text clustering applications, a clustering structure should be based on only and all the most distinctive features within a corpus. Although different term weighting approaches have been developed, the problem of identifying the most distinctive variables within a corpus remains challenging especially in the document clustering applications of literary texts. For this purpose, this study proposes a hybrid of statistical measures including variance analysis, term frequency-inverse document frequency, TF-IDF, and Principal Component Analysis (PCA) for selecting only and all the most distinctive features that can be usefully used for generating more reliable document clustering that can be usefully used in authorship attribution tasks. The study is based on a corpus of 74 novels written by 18 novelists representing different literary traditions. Results indicate that the proposed model proved effective in the successful extraction of the most distinctive features within the datasets and thus generating reliable clustering structures that can be usefully used in different computational applications of literary texts.

Highlights

With the increasing access to e-texts and the availability and power of computational tools, there has been an increasing amount of humanities computing literature on text analysis and interpretation
This study proposes a hybrid of statistical measures including variance analysis, term frequency-inverse document frequency, TF-IDF, and Principal Component Analysis (PCA) successively for selecting only and all the most distinctive features that can be usefully used for generating more reliable document clustering that can be usefully used in authorship attribution tasks
This study addresses this gap in the literature by proposing a model that combines together three statistical methods, namely variance, TF-IDF, and PCA

Summary

INTRODUCTION

With the increasing access to e-texts and the availability and power of computational tools, there has been an increasing amount of humanities computing literature on text analysis and interpretation Studies of this kind are generally classified under the broad heading computer-assisted text analysis (CATA). For reliable text clustering applications, a clustering structure should be based on only and all the most distinctive features within a corpus For this purpose, this study proposes a hybrid of statistical measures including variance analysis, term frequency-inverse document frequency, TF-IDF, and Principal Component Analysis (PCA) successively for selecting only and all the most distinctive features that can be usefully used for generating more reliable document clustering that can be usefully used in authorship attribution tasks. The study is based on a corpus of 74 novels written by 18 novelists representing different literary traditions

LITERATURE REVIEW

Methods

Procedures

ANALYSIS

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2020
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Document Length Variation in the Vector Space Clustering of News in Arabic: A Comparison of Methods
Abdulfattah Omar ... Wafya Ibrahim
International Journal of Advanced Computer Science and Applications | VOL. 11
Abdulfattah Omar, et. al.Abdulfattah Omar ... Wafya Ibrahim
01 Jan 2020
International Journal of Advanced Computer Science and Applications | VOL. 11

On Authorship Attribution of Telugu Text
S Nagaprasad ... J K R Sastry
Indian Journal of Science and Technology | VOL. 9
S Nagaprasad, et. al.S Nagaprasad ... J K R Sastry
29 Sep 2016
Indian Journal of Science and Technology | VOL. 9

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
Tatiana Litvinova ... Polina Panicheva
-
Tatiana Litvinova, et. al.Tatiana Litvinova ... Polina Panicheva
28 Jun 2019
28 Jun 2019

Author Clustering with and Without Topical Features
Polina Panicheva ... Olga Litvinova
-
Polina Panicheva, et. al.Polina Panicheva ... Olga Litvinova
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications