Abstract
Feature selection, as a pre-processing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. There are two main approaches for feature selection: wrapper methods, in which the features are selected using the supervised learning algorithm, and filter methods, in which the selection of features is independent of any learning algorithm. However, most of these techniques use feature scoring algorithms that make some basic assumptions about the distribution of the data like normality, balanced distribution of classes, non-sparsity or dense data-set, etc. The data generated in the real world rarely follow such strict criteria. In some cases such as digital advertising, the generated data matrix is actually very sparse and follows no distinct distribution. For this reason, we have come up with a new approach towards feature selection for cases where the data-sets do not follow the above-mentioned assumptions. Our methodology also presents an approach to solve the problem of skewness of data. The efficiency and effectiveness of our methods is then demonstrated by comparison with other well-known techniques of statistics like ANOVA, mutual information, KL divergence, Fisher score, Bayes' error, Chi-square, etc. The data-set used for validation is a real-world user-browsing history data-set used for ad-campaign targeting. It has very high dimensions and is highly sparse as well. Our approach reduces the number of features to a significant degree without compromising on the accuracy of the final predictions.
Highlights
High-dimensional data, in terms of number of features, is increasingly common these days in different domains
Almost all the algorithms used for the purposes of feature selection or introducing separability in the data make a few basic assumptions [2] regarding the data characteristics such as, the dataset follows a normal distribution; the dataset does not have a very high class imbalance which means that if the data-set is labeled, the different classes occur in almost equal proportions; the dataset is non-sparse which means that the data-set has a valid value for the majority of the columns of every row
We have come up with some new techniques that use some common statistical methods to quantify the separability between the features and make the task of feature selection and dimensionality reduction easier
Summary
High-dimensional data, in terms of number of features, is increasingly common these days in different domains. Almost all the algorithms used for the purposes of feature selection or introducing separability in the data make a few basic assumptions [2] regarding the data characteristics such as, the dataset follows a normal distribution; the dataset does not have a very high class imbalance which means that if the data-set is labeled, the different classes occur in almost equal proportions; the dataset is non-sparse which means that the data-set has a valid value for the majority of the columns of every row.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.