Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes.

Lisa K Johnson,C Titus Brown,Harriet Alexander

doi:10.1093/gigascience/giy158

Lisa K Johnson, C Titus Brown + Show 1 more

Open Access

https://doi.org/10.1093/gigascience/giy158

Copy DOI

Abstract

Background De novo transcriptome assemblies are required prior to analyzing RNA sequencing data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or “pipelines," on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short-read data collected as part of the Marine Microbial Eukaryotic Transcriptome Sequencing Project. The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research.ResultsNew transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics. Assemblies from the Dinoflagellata showed a higher number of contigs and unique k-mers than transcriptomes from other phyla, while assemblies from Ciliophora had a lower percentage of open reading frames compared to other phyla.ConclusionsGiven current bioinformatics approaches, there is no single “best” reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.

Highlights

The analysis of gene expression from high-throughput nucleic acid sequence data relies on the presence of a high-quality reference genome or transcriptome
The steps of the pipeline applied to the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) are as follows: Download the raw data Raw RNA sequence (RNA-seq) datasets were obtained from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) from BioProject PRJNA231566
Differences in available evaluation metrics between National Center for Genome Research (NCGR) and Lab for Data Intensive Biology (DIB) were variable The majority of transcriptome evaluation metrics collected for each sample were higher in Trinity-based DIB re-assemblies than for the Trans-ABySS-based NCGR assemblies, ‘cds’ versions (Table 1)

Summary

Introduction

The analysis of gene expression from high-throughput nucleic acid sequence data relies on the presence of a high-quality reference genome or transcriptome. When there is no reference genome or transcriptome for an organism of interest, raw RNA sequence (RNA-seq) data must be assembled de novo into a transcriptome [1]. This type of analysis is ubiquitous across many fields, including evolutionary developmental biology [2], cancer biology [3], agriculture [4, 5], ecological physiology [6, 7], and biological oceanography [8]. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community

Methods

Results

Discussion

Conclusion