Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data

Makoto Yamada,Yi Chang,Raunak Shrestha,Cenk Sahinalp,Hua Ouyang,Jiliang Tang,Filippo Menczer,Predrag Radivojac,Hiroshi Mamitsuka,Dawei Yin,Ermin Hodzic,Jose Lugo-Martinez,Avishek Saha

doi:10.1109/tkde.2018.2789451

Abstract

Machine learning methods are used to discover complex nonlinear relationships in biological and medical data. However, sophisticated learning models are computationally unfeasible for data with millions of features. Here we introduce the first feature selection method for nonlinear learning problems that can scale up to large, ultra-high dimensional biological data. More specifically, we scale up the novel Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) to handle millions of features with tens of thousand samples. The proposed method is guaranteed to find an optimal subset of maximally predictive features with minimal redundancy, yielding higher predictive power and improved interpretability. Its effectiveness is demonstrated through applications to classify phenotypes based on module expression in human prostate cancer patients and to detect enzymes among protein structures. We achieve high accuracy with as few as 20 out of one million features --- a dimensionality reduction of 99.998%. Our algorithm can be implemented on commodity cloud computing platforms. The dramatic reduction of features may lead to the ubiquitous deployment of sophisticated prediction models in mobile health care applications.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Knowledge and Data Engineering	Publication Date: Jul 1, 2018
Citations: 95	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering

Lead the way for us

Similar Papers

Research of Medical High-Dimensional Imbalanced Data Classification Ensemble Feature Selection Algorithm with Random Forest
Min Zhu ... Bo Su
-
Min Zhu, et. al.Min Zhu ... Bo Su
01 May 2017
01 May 2017

SI(FS)[formula omitted]: Fast simultaneous instance and feature selection for datasets with many features
Nicolás García-Pedrajas ... Gonzalo Cerruela-García
Pattern Recognition | VOL. 111
Nicolás García-Pedrajas, et. al.Nicolás García-Pedrajas ... Gonzalo Cerruela-García
24 Oct 2020
Pattern Recognition | VOL. 111

From big biological data to big discovery: The past decade and the future
Xuegong Zhang ... Jin Gu
Chinese Science Bulletin (Chinese Version) | VOL. 61
Xuegong Zhang, et. al.Xuegong Zhang ... Jin Gu
23 Nov 2016
Chinese Science Bulletin (Chinese Version) | VOL. 61

Minimax sparse logistic regression for very high-dimensional feature selection.
Mingkui Tan ... Ivor W Tsang
IEEE transactions on neural networks | VOL. 24
Mingkui Tan, et. al.Mingkui Tan ... Ivor W Tsang
01 Oct 2013
IEEE transactions on neural networks | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering