Abstract

Feature selection is a common dimensionality reduction technique of fundamental importance in big data. A common approach for reducing the running time of feature selection is to perform it in two stages. In the first stage a fast and simple filter is applied to select good candidates. The number of candidates is further reduced in the second stage by an accurate algorithm that may run significantly slower. There are two main variants of feature selection: unsupervised and supervised. In the supervised variant features are selected for predicting labels, while the unsupervised variant does not use labels at all. We describe a general framework that can use an arbitrary off-the-shelf unsupervised algorithm for the second stage. The algorithm is applied to the selection obtained in the first stage weighted appropriately. Our main technical result is a method for calculating weights for the columns that need to be selected in the second stage. We show that these weights can be computed as the solution to a constrained quadratic optimization problem. The solution is deterministic, and improves on previously published studies that use probabilistic ideas to compute similar weights. To the best of our knowledge our approach is the first technique for converting a supervised feature selection problem into an unsupervised problem. Complexity analysis shows that the proposed technique is very fast, can be implemented in a single pass over the data, and can take advantage of data sparsity. Experimental results show that the accuracy of the proposed method is comparable to that of much slower techniques.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.