Feature Selection in Sparse Matrices

Rahul Kumar,Manish Pathak,Vatsal Srivastava

doi:10.13189/csit.2019.070302

Abstract

Feature selection, as a pre-processing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. There are two main approaches for feature selection: wrapper methods, in which the features are selected using the supervised learning algorithm, and filter methods, in which the selection of features is independent of any learning algorithm. However, most of these techniques use feature scoring algorithms that make some basic assumptions about the distribution of the data like normality, balanced distribution of classes, non-sparsity or dense data-set, etc. The data generated in the real world rarely follow such strict criteria. In some cases such as digital advertising, the generated data matrix is actually very sparse and follows no distinct distribution. For this reason, we have come up with a new approach towards feature selection for cases where the data-sets do not follow the above-mentioned assumptions. Our methodology also presents an approach to solve the problem of skewness of data. The efficiency and effectiveness of our methods is then demonstrated by comparison with other well-known techniques of statistics like ANOVA, mutual information, KL divergence, Fisher score, Bayes' error, Chi-square, etc. The data-set used for validation is a real-world user-browsing history data-set used for ad-campaign targeting. It has very high dimensions and is highly sparse as well. Our approach reduces the number of features to a significant degree without compromising on the accuracy of the final predictions.

Highlights

High-dimensional data, in terms of number of features, is increasingly common these days in different domains
Almost all the algorithms used for the purposes of feature selection or introducing separability in the data make a few basic assumptions [2] regarding the data characteristics such as, the dataset follows a normal distribution; the dataset does not have a very high class imbalance which means that if the data-set is labeled, the different classes occur in almost equal proportions; the dataset is non-sparse which means that the data-set has a valid value for the majority of the columns of every row
We have come up with some new techniques that use some common statistical methods to quantify the separability between the features and make the task of feature selection and dimensionality reduction easier

Summary

Introduction

High-dimensional data, in terms of number of features, is increasingly common these days in different domains. Almost all the algorithms used for the purposes of feature selection or introducing separability in the data make a few basic assumptions [2] regarding the data characteristics such as, the dataset follows a normal distribution; the dataset does not have a very high class imbalance which means that if the data-set is labeled, the different classes occur in almost equal proportions; the dataset is non-sparse which means that the data-set has a valid value for the majority of the columns of every row.

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Feature Selection in Sparse Matrices

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer Science and Information Technology

Lead the way for us

Journal: Computer Science and Information Technology	Publication Date: May 1, 2019
License type: cc-by

Similar Papers

A Review of Feature Selection Techniques in Sentiment Analysis Using Filter, Wrapper, or Hybrid Methods
Pulung Hendro Prastyo ... Igi Ardiyanto
-
Pulung Hendro Prastyo, et. al.Pulung Hendro Prastyo ... Igi Ardiyanto
07 Sep 2020
07 Sep 2020

Upper-Limb Motion Recognition Based on Hybrid Feature Selection: Algorithm Development and Validation.
Qiaoqin Li ... Rongjiang Jin
JMIR mHealth and uHealth | VOL. 9
Qiaoqin Li, et. al.Qiaoqin Li ... Rongjiang Jin
02 Sep 2021
JMIR mHealth and uHealth | VOL. 9

A novel feature selection approach for biomedical data classification
Yonghong Peng ... Jianmin Jiang
Journal of Biomedical Informatics | VOL. 43
Yonghong Peng, et. al.Yonghong Peng ... Jianmin Jiang
30 Jul 2009
Journal of Biomedical Informatics | VOL. 43

A Comparative Study of Filter and Wrapper Methods on EDHS – HIV/AIDS Dataset
Daniel Mesafint Belete ... D H Manjaiah
-
Daniel Mesafint Belete, et. al.Daniel Mesafint Belete ... D H Manjaiah
01 Aug 2020
01 Aug 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Feature Selection in Sparse Matrices

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer Science and Information Technology