Abstract

Due to the extensive use of high-dimensional data and its application in a wide range of scientifc felds of research, dimensionality reduction has become a major part of the preprocessing step in machine learning. Feature selection is one procedure for reducing dimensionality. In this process, instead of using the whole set of features, a subset is selected to be used in the learning model. Feature selection (FS) methods are divided into three main categories: flters, wrappers, and embedded approaches. Filter methods only depend on the characteristics of the data, and do not rely on the learning model at hand. Divergence functions as measures of evaluating the differences between probability distribution functions can be used as flter methods of feature selection. In this paper, the performances of a few divergence functions such as Jensen-Shannon (JS) divergence and Exponential divergence (EXP) are compared with those of some of the most-known flter feature selection methods such as Information Gain (IG) and Chi-Squared (CHI). This comparison was made through accuracy rate and F1-score of classifcation models after implementing these feature selection methods.

Highlights

  • In recent years, dealing with high dimensional data has grown into a big part of machine learning and statistics including classification problems

  • We review some Feature selection (FS) methods that are compatible with this description as well as Information Gain (IG) and CHI

  • Based on Figures (1-7) and Figure 8, it can be inferred that there is a diversity in selected features based on each FS method, there is a similarity in the number of features selected most of the time

Read more

Summary

Introduction

In recent years, dealing with high dimensional data has grown into a big part of machine learning and statistics including classification problems. There are multiple ways to perform FS, but in general, this procedure is classified into three main categories [11]: filters, wrappers, and embedded methods. It can be seen from Eq(1) that if the joint probability distribution p(x, y) and product of marginal distributions of X and Y i.e. p(x)p(y) were close to each other, IG(Y, X) would get close to zero This means that we gain little information about Y provided that X is observed. Similar to Eq(1), in Eq(3) if p(xi, yj) and p(xi)p(yj) for all i and j were close to each other, CHI(X, Y ) tends to zero This method is based on another divergence function called Kagan’s Divergence [26] that can be formulated.

Related works
Divergence functions
Bregman’s divergences
Experimental Design
Average number of selected features
F1-Score
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call