MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.

David K Lim,Naim U Rashid,Joseph G Ibrahim

doi:10.1214/20-aoas1407

David K Lim, Naim U Rashid + Show 1 more

Open Access

https://doi.org/10.1214/20-aoas1407

Copy DOI

Abstract

Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: The annals of applied statistics	Publication Date: Mar 1, 2021
Citations: 7	License type: cc-by-nd

R Discovery Prime

R Discovery Prime

MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.

Abstract

Talk to us

Similar Papers

More From: The annals of applied statistics

Lead the way for us

Similar Papers

Quadratic Approximation via the SCAD Penalty with a Diverging Number of Parameters
Mingqiu Wang ... Xiaoguang Wang
Communications in Statistics - Simulation and Computation | VOL. 45
Mingqiu Wang, et. al.Mingqiu Wang ... Xiaoguang Wang
09 Jun 2014
Communications in Statistics - Simulation and Computation | VOL. 45

Tuning Parameter Selector for the Penalized Likelihood Method in Multivariate Generalized Linear Models
Xiaoguang Wang ... Jie Cui
Communications in Statistics - Theory and Methods | VOL. 42
Xiaoguang Wang, et. al.Xiaoguang Wang ... Jie Cui
02 Nov 2013
Communications in Statistics - Theory and Methods | VOL. 42

Inferential GANs and Deep Feature Selection with Applications

-

15 Jun 2020
15 Jun 2020

Probing the existence of medium pulmonary crackles via model-based clustering
Mete Yeginer ... Yasemin P Kahya
Computers in Biology and Medicine | VOL. 40
Mete Yeginer, et. al.Mete Yeginer ... Yasemin P Kahya
21 Aug 2010
Computers in Biology and Medicine | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.

Abstract

Talk to us

Similar Papers

More From: The annals of applied statistics