Sparse generalized linear model with L0 approximation for feature selection and prediction with big omics data

Zhenqiu Liu,Fengzhu Sun,Dermot P Mcgovern

doi:10.1186/s13040-017-0159-z

Zhenqiu Liu, Fengzhu Sun + Show 1 more

Open Access

https://doi.org/10.1186/s13040-017-0159-z

Copy DOI

Abstract

BackgroundFeature selection and prediction are the most important tasks for big data mining. The common strategies for feature selection in big data mining are L1, SCAD and MC+. However, none of the existing algorithms optimizes L0, which penalizes the number of nonzero features directly.ResultsIn this paper, we develop a novel sparse generalized linear model (GLM) with L0 approximation for feature selection and prediction with big omics data. The proposed approach approximate the L0 optimization directly. Even though the original L0 problem is non-convex, the problem is approximated by sequential convex optimizations with the proposed algorithm. The proposed method is easy to implement with only several lines of code. Novel adaptive ridge algorithms (L0ADRIDGE) for L0 penalized GLM with ultra high dimensional big data are developed. The proposed approach outperforms the other cutting edge regularization methods including SCAD and MC+ in simulations. When it is applied to integrated analysis of mRNA, microRNA, and methylation data from TCGA ovarian cancer, multilevel gene signatures associated with suboptimal debulking are identified simultaneously. The biological significance and potential clinical importance of those genes are further explored.ConclusionsThe developed Software L0ADRIDGE in MATLAB is available at https://github.com/liuzqx/L0adridge.

Highlights

Feature selection and prediction are the most important tasks for big data mining
The proposed method is compared with the glmnet
We compare the performance of our approach with L1, SCAD and MC+ using the popular BIC (λ = log(N)) criteria

Summary

Introduction

Feature selection and prediction are the most important tasks for big data mining. The common strategies for feature selection in big data mining are L1, SCAD and MC+. None of the existing algorithms optimizes L0, which penalizes the number of nonzero features directly. The huge number of features makes it neither practical nor feasible to predict clinical outcomes with all omics features directly. Selecting a small subset of informative features (biomarkers) to conduct association studies and clinical predictions has become an important step toward effective big data mining. Liu et al BioData Mining (2017) 10:39 penalizes the number of nonzero features directly. It is computational impossible to perform an exhaustive search when analyzing omics data sets with millions of features. L0 penalized optimization is known to be NP-hard in general (Lin et al 2010)

Methods

Results

Conclusion