A Feature Subset Selection Technique for High Dimensional Data Using Symmetric Uncertainty

Bharat Singh,Om Prakash Vyas,Nidhi Kushwaha

doi:10.4236/jdaip.2014.24012

Abstract

With the abundance of exceptionally High Dimensional data, feature selection has become an essential element in the Data Mining process. In this paper, we investigate the problem of efficient feature selection for classification on High Dimensional datasets. We present a novel filter based approach for feature selection that sorts out the features based on a score and then we measure the performance of four different Data Mining classification algorithms on the resulting data. In the proposed approach, we partition the sorted feature and search the important feature in forward manner as well as in reversed manner, while starting from first and last feature simultaneously in the sorted list. The proposed approach is highly scalable and effective as it parallelizes over both attribute and tuples simultaneously allowing us to evaluate many of potential features for High Dimensional datasets. The newly proposed framework for feature selection is experimentally shown to be very valuable with real and synthetic High Dimensional datasets which improve the precision of selected features. We have also tested it to measure classification accuracy against various feature selection process.

Highlights

Data Mining is a multidisciplinary task to find out hidden nuggets of information from data
We have proposed an algorithm for feature subset selection for High Dimensional datasets
We are using correlation based feature ranking method, symmetric uncertainty (SU), which forms the basis of our approach

Summary

Introduction

Data Mining is a multidisciplinary task to find out hidden nuggets of information from data. Feature selection is an active field of research and development since the 70’s, in multidisciplinary field It includes statistical pattern recognition [2] [3], machine learning [4]-[7], Data Mining [8]-[10] and it is extensively applied to various field such as text categorization [11] [12] image retrieval [13] [14], genomics analysis [7] [15] [16], CRM [17]. This requirement is decisive in biological applications, e.g. DNA-microarrays, genomics, and proteomics, mass spectrometry These applications are generally characterized by high dimensionality; the goal is to find a small output set of highly uncorrelated variables on which biomedical and Data Miner experts will subsequently invest considerable less time and research effort.

Literature Review and Background

Mutual Information

Symmetric Uncertainty

Relevant Feature and F-Correlation

A Correlation Based Feature Subset Selection Algorithm

Proposed Framework and Algorithm

Computational Complexity of Proposed Approach

Experimental Result and Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Data Analysis and Information Processing	Publication Date: Jan 1, 2014
Citations: 74	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Feature Subset Selection Technique for High Dimensional Data Using Symmetric Uncertainty

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Analysis and Information Processing

Lead the way for us

Similar Papers

Two-stage improved Grey Wolf optimization algorithm for feature selection on high-dimensional classification
Chaonan Shen ... Kai Zhang
Complex & Intelligent Systems | VOL. 8
Chaonan Shen, et. al.Chaonan Shen ... Kai Zhang
30 Jul 2021
Complex & Intelligent Systems | VOL. 8

Rough-FS
Rashmi Rekha Sahoo ... Smita Prava Mishra
-
Rashmi Rekha Sahoo, et. al.Rashmi Rekha Sahoo ... Smita Prava Mishra
03 Sep 2012
03 Sep 2012

Online feature selection for high-dimensional class-imbalanced data
Peng Zhou ... Xindong Wu
Knowledge-Based Systems | VOL. 136
Peng Zhou, et. al.Peng Zhou ... Xindong Wu
08 Sep 2017
Knowledge-Based Systems | VOL. 136

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm.
Garba Abdulrauf Sharifai ... Zurinahni Zainol
Genes | VOL. 11
Garba Abdulrauf Sharifai, et. al.Garba Abdulrauf Sharifai ... Zurinahni Zainol
27 Jun 2020
Genes | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Feature Subset Selection Technique for High Dimensional Data Using Symmetric Uncertainty

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Analysis and Information Processing