Abstract

BackgroundBioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists’ capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter-gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets.ResultsIn this study, we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets with weights trained by the autoencoder, where each node in the latent layer is a superset. Trained with genomic data from TCGA and evaluated with their accompanying clinical parameters, we showed gene supersets’ ability of discriminating tumor subtypes and their prognostic capability. We further demonstrated the biological relevance of the top component gene sets in the significant supersets.ConclusionsUsing autoencoder model and gene superset at its latent layer, we demonstrated that gene supersets retain sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Superset also provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.

Highlights

  • Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists’ capability to discover functional relevance of their experiment design

  • We proposed Gene Superset AutoEncoder (GSAE), a multi-layer autoencoder model that incorporates a priori defined gene sets to preserve the crucial biological features from combining gene sets in the latent layer

  • We introduced the concept of the gene superset, an unbiased combination of gene sets, with weights trained by the autoencoder, where each node in the latent layer is termed a superset

Read more

Summary

Introduction

Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists’ capability to discover functional relevance of their experiment design. Nowadays gene set based analysis has been an essential step for interpreting gene expression data, for which a variety of bioinformatics tools have been developed to extract biological insights from different aspects. Functional enrichment is the most common gene set based analysis to determine classes of genes that are associated with disease phenotypes, such as Gene Set. Enrichment Analysis (GSEA) [1]. Gene set analysis improves the biologists’ capability to interpret functional impact to their experiment design. Lau et al showed that there are only minimal overlaps between the

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call