Abstract

BackgroundSince experimental elucidation of gene function is often laborious, various in silico methods have been developed to predict gene function of uncharacterized genes. Since functionally related genes are often expressed in the same tissues, conditions and developmental stages (co-expressed), functional annotation of characterized genes can be transferred to co-expressed genes lacking annotation. With genome-wide expression data available, the construction of co-expression networks, where genes are nodes and edges connect significantly co-expressed genes, provides unprecedented opportunities to predict gene function. However, the construction of such networks requires large volumes of high-quality data, multiple processing steps and a considerable amount of computation power. While efficient tools exist to process RNA-Seq data, pipelines which combine them to construct co-expression networks efficiently are currently lacking.ResultsLSTrAP (Large-Scale Transcriptome Analysis Pipeline), presented here, combines all essential tools to construct co-expression networks based on RNA-Seq data into a single, efficient workflow. By supporting parallel computing on computer cluster infrastructure, processing hundreds of samples becomes feasible as shown here for Arabidopsis thaliana and Sorghum bicolor, which comprised 876 and 215 samples respectively. The former was used here to show how the quality control, included in LSTrAP, can detect spurious or low-quality samples. The latter was used to show how co-expression networks are able to group known photosynthesis genes and imply a role in this process of several, currently uncharacterized, genes.ConclusionsLSTrAP combines the most popular and performant methods to construct co-expression networks from RNA-Seq data into a single workflow. This allows large amounts of expression data, required to construct co-expression networks, to be processed efficiently and consistently across hundreds of samples. LSTrAP is implemented in Python 3.4 (or higher) and available under MIT license from https://github.molgen.mpg.de/proost/LSTrAP

Highlights

  • Since experimental elucidation of gene function is often laborious, various in silico methods have been developed to predict gene function of uncharacterized genes

  • We present the LargeScale Transcriptome Analysis Pipeline (LSTrAP), which pre-processes RNA-seq data, maps it to the genome, performs quality control and produces co-expression networks, along with functional and comparative genomics data to enable a host of downstream analyses

  • To avoid inclusion of such samples, LSTrAP indicates which samples are potentially problematic based on two metrics; the percentage of reads TopHat2 [21] is able to map to the genome and the fraction of those reads that HTSeq-Count maps to coding sequences

Read more

Summary

Introduction

Since experimental elucidation of gene function is often laborious, various in silico methods have been developed to predict gene function of uncharacterized genes. While efficient tools exist to process RNA-Seq data, pipelines which combine them to construct co-expression networks efficiently are currently lacking. Determining a gene’s function is laborious and time consuming, numerous in silico methods have emerged to predict gene function [1]. Proost et al BMC Bioinformatics (2017) 18:444 transferred to uncharacterized neighbours in the network. This principle has successfully been used to predict gene function in various species from various kingdoms [10, 11]. RNA sequencing (RNAseq) which does allow determining expression levels of a near-complete set of genes has become the norm, as RNA-Seq has become increasingly affordable

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call