Abstract

Increasingly many applications of machine learning are encountering large data that were almost unimaginable just a few years ago, and hence, many of the current algorithms cannot handle, i.e., do not scale to, today's extremely large volumes of data. The data are made up of a large set of features describing each observation, and the complexity of the models for making predictions tend to increase not only with the number of observations, but also the number of features. Fortunately, not all of the features that make up the data carry meaningful information about making the predictions. Thus irrelevant features should be filtered from the data prior to building a model. Such a process of removing features to produce a subset is commonly referred to as feature subset selection. In this work, we present two new filter-based feature subset selection algorithms that are scalable to large data sets that address: (i) potentially large & distributed data sets, and (ii) they are capable of scaling to very large feature sets. Our first proposed algorithm, Neyman-Pearson Feature Selection (NPFS), uses a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any feature selection algorithm, regardless of the feature selection criteria used, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point, and it fits into a computationally attractive MapReduce model. We also describe a sequential learning framework for feature subset selection (SLSS) that scales with both the number of features as well as the number of observations. SLSS uses bandit algorithms to process features and form a level of importance for each feature. Feature selection is performed independently from the optimization of any classifier to reduce unnecessary complexity. We demonstrate the capabilities of NPFS and SLSS on synthetic and real-world data sets. We also present a new approach for classifier-dependent feature selection that is an online learning algorithm that easily handles large amounts of missing feature values in a data stream. There are many real-world applications that can benefit from scalable feature subset selection algorithms; one such area is the study of the microbiome (i.e., the study of micro-organisms and their influence on the environments that they inhabit). Feature subset selection algorithms can be used to sift through massive amounts of data collected from the genomic sciences to help microbial ecologists understand the microbes – particularly the micro-organisms that are the best indicators by some phenotype, such as healthy or unhealthy. In this work, we provide insights into data collected from the American Gut Project, and deliver open-source software implementations for feature selection with biological data formats.%%%%Ph.D., Electrical Engineering – Drexel University, 2015

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.