Abstract

BackgroundThe development of a disease is a complex process that may result from joint effects of multiple genes. In this article, we propose the overlapping group screening (OGS) approach to determining active genes and gene-gene interactions incorporating prior pathway information. The OGS method is developed to overcome the challenges in genome-wide data analysis that the number of the genes and gene-gene interactions is far greater than the sample size, and the pathways generally overlap with one another. The OGS method is further proposed for patients’ survival prediction based on gene expression data.ResultsSimulation studies demonstrate that the performance of the OGS approach in identifying the true main and interaction effects is good and the survival prediction accuracy of OGS with the Lasso penalty is better than the ordinary Lasso method. In real data analysis, we identify several significant genes and/or epistasis interactions that are associated with clinical survival outcomes of diffuse large B-cell lymphoma (DLBCL) and non-small-cell lung cancer (NSCLC) by utilizing prior pathway information from the KEGG pathway and the GO biological process databases, respectively.ConclusionsThe OGS approach is useful for selecting important genes and epistasis interactions in the ultra-high dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The OGS approach is generally applicable to various types of outcome data (quantitative, qualitative, censored event time data) and regression models (e.g. linear, logistic, and Cox’s regression models).

Highlights

  • The development of a disease is a complex process that may result from joint effects of multiple genes

  • In the following simulations, we investigate the performances of the proposed overlapping group screening (OGS) approach in variable selection, estimation, and prediction, and compared them with those from the “Oracle”, “Univariate Selection”, “Ordinary Least absolute shrinkage and selection operator (Lasso)”, and “two-stage grouped sure independence screening (TS-GSIS) Lasso” methods

  • The “TS-GSIS Lasso” method is essentially proposed by Fang et al [2], except that we apply the sequence kernel association test (SKAT) test to obtain the group-specific significance

Read more

Summary

Introduction

The development of a disease is a complex process that may result from joint effects of multiple genes. Discovering important pathways, genes, and gene-gene interactions that account for the phenotype of interest has continued to be a key challenge in genome-wide expression analysis [1]. Under this high-dimensional data setting, single and multiple biomarker (e.g. gene) tests commonly used usually have limited power to detect causal biomarkers associated with the clinical phenotypes. To identify causal interaction effects of single-nucleotide polymorphisms (SNPs) on a quantitative or disease trait, Fang et al [2] develop a two-stage grouped sure independence screening (TS-GSIS) procedure using gene-based SNP sets. A potential drawback for the TS-GSIS method is that, it is developed in

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call