Abstract

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera's within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.

Highlights

  • In the field of transcriptomics, awareness of chimeric sequences has been present for many years [1,2], but with the expansion of short-read sequencing technologies [3], and the associated exponential growth of sequence information stored [4], chimera quantification has become essential

  • Within transcriptome reference sets, such as the cDNA databases available from Ensembl representing various species [5], or those that are de novo assembled from short-read RNA-Seq data, non-chimeric sequences are direct representations of transcribed genes, while artificially generated chimeric ones are mosaics of two or more pieces of DNA incorrectly pieced together

  • The latter occurring during library preparation [6,7], or during the de novo assembly process [8,9], where there is a requirement to traverse paths across graphs constructed from read data that ranges in complexity depending on the nature of the gene families being represented [10,11,12]

Read more

Summary

Author summary

Non-chimeric sequences are representations of transcribed genes, while artificially generated chimeric ones are mosaics of two or more pieces of DNA incorrectly pieced together. One area where such sets are utilized is in the quantification of gene expression patterns; where RNA-Seq reads are mapped to the sequences within, and subsequent count values reflect expression levels. Contigs created by CStone are labelled in a manner that indicates whether or not they are non-chimeric This encourages contig dependent results to be presented with increased objectivity by maintaining the context of ambiguity associated with the assembly process. We have quantified the relationship between chimeras within reference sets and the identification of differentially expressed genes

Introduction
Results
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.