Abstract

BackgroundWith the fast advances in nextgen sequencing technology, high-throughput RNA sequencing has emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of transcripts provides an important solution to transcriptome analysis for organisms with no reference genome. However, there lacked understanding on how the different variables affected assembly outcomes, and there was no consensus on how to approach an optimal solution by selecting software tool and suitable strategy based on the properties of RNA-Seq data.ResultsTo reveal the performance of different programs for transcriptome assembly, this work analyzed some important factors, including k-mer values, genome complexity, coverage depth, directional reads, etc. Seven program conditions, four single k-mer assemblers (SK: SOAPdenovo, ABySS, Oases and Trinity) and three multiple k-mer methods (MK: SOAPdenovo-MK, trans-ABySS and Oases-MK) were tested. While small and large k-mer values performed better for reconstructing lowly and highly expressed transcripts, respectively, MK strategy worked well for almost all ranges of expression quintiles. Among SK tools, Trinity performed well across various conditions but took the longest running time. Oases consumed the most memory whereas SOAPdenovo required the shortest runtime but worked poorly to reconstruct full-length CDS. ABySS showed some good balance between resource usage and quality of assemblies.ConclusionsOur work compared the performance of publicly available transcriptome assemblers, and analyzed important factors affecting de novo assembly. Some practical guidelines for transcript reconstruction from short-read RNA-Seq data were proposed. De novo assembly of C. sinensis transcriptome was greatly improved using some optimized methods.

Highlights

  • With the fast advances in nextgen sequencing technology in recent years, massively parallel cDNA sequencing (RNA-Seq) has emerged as a powerful and cost-effective way for transcriptome study

  • RNA-Seq data sets RNA-Seq data sets used in this study were all publicly available, and could be retrieved from NCBI SRA database. They included a standard Illumina data set from fruit fly, D. melanogaster, a strand-specific data set from fission yeast, S. pombe, and a standard data set from tea plant, C. sinensis

  • Trans-ABySS was developed by ABySS team that adopted MK strategy to ABySS

Read more

Summary

Introduction

With the fast advances in nextgen sequencing technology in recent years, massively parallel cDNA sequencing (RNA-Seq) has emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of short sequence reads into transcripts allows researchers to reconstruct the sequences of full transcriptome, identify and catalog all expressed genes, separate isoforms, and capture the expression levels of transcripts. Assemblers must be tuned to handle conditions that were not present for genome assembly. Among those conditions, transcripts are expressed at both low and high levels, spanning a difference of ten thousands folds. With the fast advances in nextgen sequencing technology, high-throughput RNA sequencing has emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of transcripts provides an important solution to transcriptome analysis for organisms with no reference genome. There lacked understanding on how the different variables affected assembly outcomes, and there was no consensus on how to approach an optimal solution by selecting software tool and suitable strategy based on the properties of RNA-Seq data

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call