A Fast Non-Redundant Feature Selection Technique for Text Data

Syed Fawad Hussain,Khurram Khurshid,Akhtar Khalil,Hafiz Zaheer-Ud-Din Babar,Rashad M Jillani,Muhammad Hanif

doi:10.1109/access.2020.3028469

Abstract

Feature selection is critical in reducing the size of data and improving classifier accuracy by selecting an optimum subset of the overall features. Traditionally, each feature is given a score against a particular category (such as using Mutual Information) and the task of feature selection comes down to choosing the top $k$ ranked features with the best average score across all categories. However, this approach has two major drawbacks. Firstly, the maximum or average score of a feature with a class might not necessarily determine its discriminating strength among samples of other classes. Secondly, most feature selection methods only use the scores to select the discriminating features from the corpus without taking into account the redundancy of information provided by the selected features. In this paper, we propose a new feature ranking score measure called the Discriminative Mutual Information (DMI) score. This score helps to select features that distinguish samples of one category against all other categories. Moreover, Non-Redundant Feature Selection (NRFS) heuristic is also proposed that explicitly takes the problem of feature redundancy into account when selecting the features set. The performance of our approach is investigated and compared with other feature selection techniques on datasets derived from high-dimensional text corpora using multiple classification algorithms. The results show that the proposed method leads to better classification micro-F1 score as compared to other state-of-the-art methods. In particular, the proposed method shows great improvement when the number of selected features are small as well as an overall higher robustness to label noise.

Highlights

With the rapid increase in our capacity to generate data, storing and retrieving data efficiently has become increasingly difficult
We provide an empirical analysis of several popular feature selection techniques employed in text categorization
It is usually not straight forward to compare which feature selection technique is better since different feature subsets might have different inherent characteristics and may be suited for some particular tasks

Summary

Introduction

With the rapid increase in our capacity to generate data, storing and retrieving data efficiently has become increasingly difficult. Only a subset of features (terms/words in this case) are helpful in discriminating between categories of documents while an overwhelming majority are usually quite generic in nature, connecting phrases, emphasizing a verb or a noun, etc.

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 65	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Fast Non-Redundant Feature Selection Technique for Text Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A new feature selection method to improve the document clustering using particle swarm optimization algorithm
Laith Mohammad Abualigah ... Essam Said Hanandeh
Journal of Computational Science | VOL. 25
Laith Mohammad Abualigah, et. al.Laith Mohammad Abualigah ... Essam Said Hanandeh
06 Sep 2017
Journal of Computational Science | VOL. 25

Improving Classification Accuracy Using Combined Filter+Wrapper Feature Selection Technique
M.S Suresh Sumi ... Athi Narayanan
-
M.S Suresh Sumi, et. al.M.S Suresh Sumi ... Athi Narayanan
01 Feb 2019
01 Feb 2019

Invariant optimal feature selection: A distance discriminant and feature ranking based solution
Jianning Liang ... Adam Winstanley
Pattern Recognition | VOL. 41
Jianning Liang, et. al.Jianning Liang ... Adam Winstanley
22 Oct 2007
Pattern Recognition | VOL. 41

An Evolutionary Multitasking Algorithm With Multiple Filtering for High-Dimensional Feature Selection
Lingjie Li ... Qiuzhen Lin
IEEE Transactions on Evolutionary Computation | VOL. 27
Lingjie Li, et. al.Lingjie Li ... Qiuzhen Lin
01 Aug 2023
IEEE Transactions on Evolutionary Computation | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Fast Non-Redundant Feature Selection Technique for Text Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access