Abstract

Feature selection, as a pre-processing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. There are two main approaches for feature selection: wrapper methods, in which the features are selected using the supervised learning algorithm, and filter methods, in which the selection of features is independent of any learning algorithm. However, most of these techniques use feature scoring algorithms that make some basic assumptions about the distribution of the data like normality, balanced distribution of classes, non-sparsity or dense data-set, etc. The data generated in the real world rarely follow such strict criteria. In some cases such as digital advertising, the generated data matrix is actually very sparse and follows no distinct distribution. For this reason, we have come up with a new approach towards feature selection for cases where the data-sets do not follow the above-mentioned assumptions. Our methodology also presents an approach to solve the problem of skewness of data. The efficiency and effectiveness of our methods is then demonstrated by comparison with other well-known techniques of statistics like ANOVA, mutual information, KL divergence, Fisher score, Bayes' error, Chi-square, etc. The data-set used for validation is a real-world user-browsing history data-set used for ad-campaign targeting. It has very high dimensions and is highly sparse as well. Our approach reduces the number of features to a significant degree without compromising on the accuracy of the final predictions.

Highlights

  • High-dimensional data, in terms of number of features, is increasingly common these days in different domains

  • Almost all the algorithms used for the purposes of feature selection or introducing separability in the data make a few basic assumptions [2] regarding the data characteristics such as, the dataset follows a normal distribution; the dataset does not have a very high class imbalance which means that if the data-set is labeled, the different classes occur in almost equal proportions; the dataset is non-sparse which means that the data-set has a valid value for the majority of the columns of every row

  • We have come up with some new techniques that use some common statistical methods to quantify the separability between the features and make the task of feature selection and dimensionality reduction easier

Read more

Summary

Introduction

High-dimensional data, in terms of number of features, is increasingly common these days in different domains. Almost all the algorithms used for the purposes of feature selection or introducing separability in the data make a few basic assumptions [2] regarding the data characteristics such as, the dataset follows a normal distribution; the dataset does not have a very high class imbalance which means that if the data-set is labeled, the different classes occur in almost equal proportions; the dataset is non-sparse which means that the data-set has a valid value for the majority of the columns of every row.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call