Abstract

BackgroundConstructing gene coexpression networks is a powerful approach for analyzing high-throughput gene expression data towards module identification, gene function prediction, and disease-gene prioritization. While optimal workflows for constructing coexpression networks, including good choices for data pre-processing, normalization, and network transformation, have been developed for microarray-based expression data, such well-tested choices do not exist for RNA-seq data. Almost all studies that compare data processing and normalization methods for RNA-seq focus on the end goal of determining differential gene expression.ResultsHere, we present a comprehensive benchmarking and analysis of 36 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets. We test these workflows on both large, homogenous datasets and small, heterogeneous datasets from various labs. We analyze the workflows in terms of aggregate performance, individual method choices, and the impact of multiple dataset experimental factors. Our results demonstrate that between-sample normalization has the biggest impact, with counts adjusted by size factors producing networks that most accurately recapitulate known tissue-naive and tissue-aware gene functional relationships.ConclusionsBased on this work, we provide concrete recommendations on robust procedures for building an accurate coexpression network from an RNA-seq dataset. In addition, researchers can examine all the results in great detail at https://krishnanlab.github.io/RNAseq_coexpression to make appropriate choices for coexpression analysis based on the experimental factors of their RNA-seq dataset.

Highlights

  • Constructing gene coexpression networks is a powerful approach for analyzing high-throughput gene expression data towards module identification, gene function prediction, and disease-gene prioritization

  • The Genotype-Tissue Expression (GTEx) data was critical for investigating the impact of experimental factors such as sample size, which we performed by doing multiple rounds of random sampling from GTEx datasets

  • We focused on three key stages of data processing and network building: (a) withinsample normalization: counts per million (CPM), transcripts per million (TPM), and reads per kilobase per million (RPKM); (b) between-sample normalization: quantile (QNT), trimmed mean of M values (TMM), and upper quartile (UQ); in addition, we tested two new variations of TMM and UQ—counts adjusted with TMM factors (CTF); counts adjusted with upper quartile factors (CUF)—that directly adjust counts by the size factors but does not correct by library size; and (c) network transformation: weighted topological overlap (WTO) and context likelihood of relatedness (CLR)

Read more

Summary

Introduction

Constructing gene coexpression networks is a powerful approach for analyzing high-throughput gene expression data towards module identification, gene function prediction, and disease-gene prioritization. While optimal workflows for constructing coexpression networks, including good choices for data preprocessing, normalization, and network transformation, have been developed for microarray-based expression data, such well-tested choices do not exist for RNA-seq data. Almost all studies that compare data processing and normalization methods for RNA-seq focus on the end goal of determining differential gene expression. Constructing gene coexpression networks is a powerful and widely used approach for analyzing high-throughput gene expression data from microarray and RNA-seq technologies [1]. Multiple experimental factors impact the quantification of the expression of individual genes and the coexpression between pairs of genes, making it necessary to normalize and transform high-throughput gene expression data before downstream analysis. Appropriately normalizing and transforming RNA-seq data along with adequately transforming the coexpression strengths should yield more accurate estimates of gene-gene coexpression that best capture functional relationships between genes

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call