Pincho: A Modular Approach to High Quality De Novo Transcriptomics.

Randy Ortiz,Juan C Santos,Priyanka Gera,Christopher Rivera

doi:10.3390/genes12070953

Abstract

Transcriptomic reconstructions without reference (i.e., de novo) are common for data samples derived from non-model biological systems. These assemblies involve massive parallel short read sequence reconstructions from experiments, but they usually employ ad-hoc bioinformatic workflows that exhibit limited standardization and customization. The increasing number of transcriptome assembly software continues to provide little room for standardization which is exacerbated by the lack of studies on modularity that compare the effects of assembler synergy. We developed a customizable management workflow for de novo transcriptomics that includes modular units for short read cleaning, assembly, validation, annotation, and expression analysis by connecting twenty-five individual bioinformatic tools. With our software tool, we were able to compare the assessment scores based on 129 distinct single-, bi- and tri-assembler combinations with diverse k-mer size selections. Our results demonstrate a drastic increase in the quality of transcriptome assemblies with bi- and tri- assembler combinations. We aim for our software to improve de novo transcriptome reconstructions for the ever-growing landscape of RNA-seq data derived from non-model systems. We offer guidance to ensure the most complete transcriptomic reconstructions via the inclusion of modular multi-assembly software controlled from a single master console.

Highlights

Homemade de novo transcriptomic workflows tend to be idiosyncratic to specific study goals, unoptimizable to other studies and, in many cases, left unpublished or buried in supplementary materials
Pincho consists of twenty-five functions which fall under six modules: preprocessing; de novo assembly (ABySS [11,12], Tadpole [13,14], BinPacker [15,16], IDBA-tran [17,18], MEGAHIT [19,20], Oases/Velvet [21,22], rnaSPAdes [3,23], Shannon Cpp [5,24], SPAdes [25,26], Trans-ABySS [27,28], TransLig [29,30], and Trinity [4,31], Table 1; post-assembly; assembly assessment; annotation using a user reference (NCBI BLASTX, BLASTN, and BLASTP; [40,41,42]); and expression analysis
We note a significant increase in Average assessment scores (AAS) from single- to bi-assembly approaches across all assemblers

Summary

Introduction

Homemade de novo transcriptomic workflows tend to be idiosyncratic to specific study goals, unoptimizable to other studies and, in many cases, left unpublished or buried in supplementary materials. We could say Rnnotator [1] in 2010 was the first singleassembler transcriptomic pipeline to be publicly available, while the Oyster River Protocol (ORP; [2]) in 2018 was the first multi-assembler pipeline available. This presumed eightyear period between single- and multi-assembler approaches is odd considering multiassembler methods have been shown to produce reconstructions with higher degrees of completeness [2]. The closest comparison to our workflow would be the ORP; it employs a rigid tri-assembly approach to produce high quality transcriptomes via rnaSPAdes (k55, k75; [3]), Trinity (k25; [4]) and Shannon (k75; [2,5]). Pincho [6], allows the user to design and customize their own k-mer list and number of assemblers, among other parameters

Methods

Results

Discussion

Conclusion