Classification of gene signatures for their information value and functional redundancy

Laura Cantini,Loredana Martignetti,Nils Blüthgen,Mattias Rydenfelt,Andrei Zinovyev,Laurence Calzone,Emmanuel Barillot

doi:10.1038/s41540-017-0038-8

Abstract

Gene signatures are more and more used to interpret results of omics data analyses but suffer from compositional (large overlap) and functional (correlated read-outs) redundancy. Moreover, many gene signatures rarely come out as significant in statistical tests. Based on pan-cancer data analysis, we construct a restricted set of 962 signatures defined as informative and demonstrate that they have a higher probability to appear enriched in comparative cancer studies. We show that the majority of informative signatures conserve their weights for the genes composing the signature (eigengenes) from one cancer type to another. We finally construct InfoSigMap, an interactive online map of these signatures and their cross-correlations. This map highlights the structure of compositional and functional redundancies between informative signatures, and it charts the territories of biological functions. InfoSigMap can be used to visualize the results of omics data analyses and suggests a rearrangement of existing gene sets.

Highlights

The majority of the studies exploring gene expression data result in one or more gene signatures, i.e., list of genes sharing a common pattern of expression that can be employed to classify groups of samples in any independent dataset
A large The Cancer Genome Atlas (TCGA) compendium of gene expression data derived from 32 solid cancer types was employed to restrict the input collection of 12,096 gene signatures to 962 informative ones
Compendia is posing two main challenges related to the reliability and the redundancy of the collected gene sets

Summary

Introduction

The majority of the studies exploring gene expression data result in one or more gene signatures, i.e., list of genes sharing a common pattern of expression that can be employed to classify groups of samples in any independent dataset. Not all the signatures contained in these compendia are informative and the number of gene sets representing the same biological process is not equilibrated These two phenomena affect the results of classical transcriptomic data analysis with heavy p-value corrections producing a high number of false negative results. Two signatures may represent two different transcriptional read-outs of the same biological process, we will refer to them as functionally redundant. The existence of multiple functionally redundant signatures affects results of classical transcriptomic data analysis by highly scoring multiple gene sets belonging to analogous/related biological processes. These multiple comparisons of redundant signatures can potentially hide relevant hits. Any estimation of the functional redundancy is conditioned by the context (e.g., certain cancer type) and depends on the type of data used to evaluate the redundancy

Methods

Results

Conclusion