A Feature Selection Method for Classifying Highly Similar Text Documents

Jeenyoung Kim,Daiki Min

doi:10.7232/iems.2021.20.2.148

Abstract

In the era of big data, the importance of data classification is increasing. However, when it comes to classifying text documents, several obstacles degrade classification performance. These include multi-class documents, high levels of similarity between classes, class size imbalance, high dimensional representation space, and a low frequency of unique and discriminative features. To overcome these obstacles and improve classification performance, this paper proposes a novel feature selection method that effectively utilizes both unique and overlapping features. In general, feature selection methods have ignored unique features that occur only one class because of low frequency while it provides better discriminative-power. On the contrary, overlapping features, which are found in several classes with high frequency, have been also less preferred because of low discriminative-power. The proposed feature selection method attempts to use these two types of features as complementary with aims to improve overall classification performance for highly similar text documents. Extensive numerical analysis have been conducted for three benchmarking datasets with a support vector machine (SVM) classifier. The proposed method showed that not only the class with high similarity but also the general classification performance is superior to the conventional feature selection methods, such as the global feature set, local feature set, discriminative feature set, and information gain.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Feature Selection Method for Classifying Highly Similar Text Documents

Abstract

Talk to us

Similar Papers

More From: Industrial Engineering & Management Systems

Lead the way for us

Similar Papers

Hybridized Feature Set for Accurate Arabic Dark Web Pages Classification
Thabit Sabbah ... Ali Selamat
-
Thabit Sabbah, et. al.Thabit Sabbah ... Ali Selamat
01 Jan 2015
01 Jan 2015

Novel feature selection method based on harmony search for email classification
Youwei Wang ... Xiaodong Zhu
Knowledge-Based Systems | VOL. 73
Youwei Wang, et. al.Youwei Wang ... Xiaodong Zhu
23 Oct 2014
Knowledge-Based Systems | VOL. 73

Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods
Ali Ebrahimi ... Kjeld Andersen
BMC Medical Informatics and Decision Making | VOL. 22
Ali Ebrahimi, et. al.Ali Ebrahimi ... Kjeld Andersen
23 Nov 2022
BMC Medical Informatics and Decision Making | VOL. 22

Density Based Feature Selection Method for Medical Datasets
Manonmani M* ... Dr Sarojini Balakrishnan
International Journal of Innovative Technology and Exploring Engineering | VOL. 8
Manonmani M*, et. al.Manonmani M* ... Dr Sarojini Balakrishnan
30 Oct 2019
International Journal of Innovative Technology and Exploring Engineering | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Feature Selection Method for Classifying Highly Similar Text Documents

Abstract

Talk to us

Similar Papers

More From: Industrial Engineering &amp; Management Systems

More From: Industrial Engineering & Management Systems