Two new feature selection metrics for text classification

Durmuş Özkan Şahin,Erdal Kılıç

doi:10.1080/00051144.2019.1602293

Abstract

ABSTRACTObtaining meaningful information from data has become the main problem. Hence data mining techniques have gained importance. Text classification is one of the most commonly studied areas of data mining. The main problem about text classification is the increase in the required time and a decrease in the success of classification because of data size. To determine the right feature selection methods for text classification is the main purpose of this study. Metrics that are used frequently for feature selection like Chi-square and Information Gain were applied over different data sets and performance was measured. In this study two feature selection metrics, which are based on filtration, are recommended as alternatives to the current ones. The first recommended metric is Relevance Frequency Feature Selection metric that was obtained by adding new parameters to Relevance Frequency method that is used for term weighting in text classification. The second one is the alternative Accuracy2 metric, which was obtained by changing the parameters of Accuracy2 metric. It was observed that the suggested Relevance Frequency Feature Selection and Alternative Accuracy2 metrics offer successful results as the current metrics used frequently.

Highlights

The internet becomes more common as the days pass and in the meantime, smartphone and tablet use increases
The first recommended metric is Relevance Frequency Feature Selection metric that was obtained by adding new parameters to Relevance Frequency method that is used for term weighting in text classification
It can be stated that all metrics, except Relevance Frequency (RF), generated a potential feature that is related to acq category

Summary

Introduction

The internet becomes more common as the days pass and in the meantime, smartphone and tablet use increases. This increase in use brings an increase in the amount of data that is created and stored in text format like e-books, emails, Facebook and Twitter. The most important one of these studies is the expert text classification system based on rules and developed by Carnegie Group over Reuters data set [4]. As hardware components like memory and CPU become more advanced and cheaper, use of machine-learning algorithms have become more common and they were tried over text classification problems. The main problem in text classification is the excessive size of the data. It is important to choose terms that have high distinction potential rather than all terms in text classification

Contribution and motivation

Organization

Related works

Document frequency thresholding metric

Chi-Squared metric

Information gain

Acc and Acc2 metrics

Proposed metrics

Used data sets

Reuters data set

Experimental settings

Experimental results

Comparison of features obtained via metrics

Classification successes of metrics

Conclusion and future works

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Automatika	Publication Date: Apr 3, 2019
Citations: 25	License type: open-access

R Discovery Prime

R Discovery Prime

Two new feature selection metrics for text classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Automatika

Lead the way for us

Similar Papers

Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification
George Forman
-
George FormanGeorge Forman
01 Jan 2002
01 Jan 2002

On the Relationship between Feature Selection Metrics and Accuracy.
Elise Epstein ... Soumya Ray
Entropy (Basel, Switzerland) | VOL. 25
Elise Epstein, et. al.Elise Epstein ... Soumya Ray
11 Dec 2023
Entropy (Basel, Switzerland) | VOL. 25

Importance weighted feature selection strategy for text classification
Baoli Li
-
Baoli LiBaoli Li
01 Nov 2016
01 Nov 2016

Aggressive and effective feature selection using genetic programming
Isac Sandin ... Thiago Salles
-
Isac Sandin, et. al.Isac Sandin ... Thiago Salles
01 Jun 2012
01 Jun 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Two new feature selection metrics for text classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Automatika