Gene set selection via LASSO penalized regression (SLPR).

H Robert Frost,Christopher I Amos

doi:10.1093/nar/gkx291

Abstract

Gene set testing is an important bioinformatics technique that addresses the challenges of power, interpretation and replication. To better support the analysis of large and highly overlapping gene set collections, researchers have recently developed a number of multiset methods that jointly evaluate all gene sets in a collection to identify a parsimonious group of functionally independent sets. Unfortunately, current multiset methods all use binary indicators for gene and gene set activity and assume that a gene is active if any containing gene set is active. This simplistic model limits performance on many types of genomic data. To address this limitation, we developed gene set Selection via LASSO Penalized Regression (SLPR), a novel mapping of multiset gene set testing to penalized multiple linear regression. The SLPR method assumes a linear relationship between continuous measures of gene activity and the activity of all gene sets in the collection. As we demonstrate via simulation studies and the analysis of TCGA data using MSigDB gene sets, the SLPR method outperforms existing multiset methods when the true biological process is well approximated by continuous activity measures and a linear association between genes and gene sets.

Highlights

Existing multiset methods GenGO by Lu et al [2], Markov chain ontology analysis (MCOA) by Frost and McCray [3], model-based gene set analysis (MGSA) by Bauer et al [4, 5] and multifunctional analysis (MFA) by Wang et al [6] all share a similar generative model for the observed genetic data in terms of gene set activation
MGSA and MFA differ in the form of the prior distribution assumed for S and the constraints placed on gene activation given gene set activation (MFA uses a prior consistent with a more restrictive activation hypothesis that requires a gene set to be active if all member genes are active)
It is assumed that prior knowledge allows the genomic variables to be grouped into a collection of m overlapping sets, where each set is associated with a specific biological function, e.g., Gene Ontology (GO) terms

Summary

Methods

Existing multiset methods GenGO by Lu et al [2], Markov chain ontology analysis (MCOA) by Frost and McCray [3], model-based gene set analysis (MGSA) by Bauer et al [4, 5] and multifunctional analysis (MFA) by Wang et al [6] all share a similar generative model for the observed genetic data in terms of gene set activation. Note that according to Eq (4) it is possible to compute a value for the log-likelihood specified in Eq (7) given the gene set definitions in A and the true gene set activity states in S Given this model, the goal of multiset methods GenGO, MCOA, MGSA and MFA is to estimate the true activity states of all gene sets S, and genes T , based on the observed data O and gene set definitions A. MCOA uses a modified version of the GenGO penalized MLE approach where the regularization constant is computed using the eigenvector centrality from a Markov chain model of the gene sets and genomic variables Both MGSA and MFA take a Bayesian approach to estimate the posterior distribution of gene set activation with the maximum posterior used for Sand T. MGSA and MFA differ in the form of the prior distribution assumed for S and the constraints placed on gene activation given gene set activation (MFA uses a prior consistent with a more restrictive activation hypothesis that requires a gene set to be active if all member genes are active)

Mathematical details of SLPR method

Required data

SLPR model

SLPR estimation

SLPR extensions

Model assessment

SLPR implementation

Benchmark methods

Simulation study design

Real data analysis design

Real data concordance analysis

Log-tansformed outcome

Inter-gene correlation

Simulation model assessment results

REACTOME APOPTOTIC CLEAVAGE OF CELL ADHESION PROTEINS:

KEGG COMPLEMENT AND COAGULATION CASCADES:

10. REACTOME COLLAGEN FORMATION

REACTOME EXTRACELLULAR MATRIX ORGANIZATION:

KEGG MELANOMA:

10. KEGG ENDOMETRIAL CANCER

TCGA concordance results

2.10 TCGA model assessment results

2.11 TCGA results using CAMERA method

SLPR R Code

> # Internal methods

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nucleic Acids Research	Publication Date: May 2, 2017
Citations: 46	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Gene set selection via LASSO penalized regression (SLPR).

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nucleic Acids Research

Lead the way for us

Similar Papers

An Independent Filter for Gene Set Testing Based on Spectral Enrichment.
H Robert Frost ... Jason H Moore
IEEE/ACM transactions on computational biology and bioinformatics | VOL. 12
H Robert Frost, et. al.H Robert Frost ... Jason H Moore
01 Sep 2015
IEEE/ACM transactions on computational biology and bioinformatics | VOL. 12

Detection of Simultaneous Group Effects in MicroRNA Expression and Related Target Gene Sets
Stephan Artmann ... Klaus Jung
PLoS ONE | VOL. 7
Stephan Artmann, et. al.Stephan Artmann ... Klaus Jung
19 Jun 2012
PLoS ONE | VOL. 7

Unsupervised gene set testing based on random matrix theory.
H Robert Frost ... Christopher I Amos
BMC Bioinformatics | VOL. 17
H Robert Frost, et. al.H Robert Frost ... Christopher I Amos
04 Nov 2016
BMC Bioinformatics | VOL. 17

Systematic single-cell pathway analysis to characterize early Tcell activation.
Jack A Bibby ... Divyansh Agarwal
Cell reports | VOL. 41
Jack A Bibby, et. al.Jack A Bibby ... Divyansh Agarwal
01 Nov 2022
Cell reports | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gene set selection via LASSO penalized regression (SLPR).

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nucleic Acids Research