Abstract

The unprecedented volume of genomic and transcriptomic data analyzed by software pipelines makes verification of inferences based on such data, albeit theoretically possible, a challenging proposition. The availability of intermediate data can immensely aid re-validation efforts. One such example is the transcriptome, assembled from raw RNA-seq reads, which is frequently used for annotation and quantification of genes transcribed. The quality of the assembled transcripts influences the accuracy of inferences based on them. Here the publicly available transcriptome from Cicer arietinum (ICC4958; Desi chickpea, http://www.nipgr.res.in/ctdb.html)1 was analyzed using YeATS2. This revealed that a majority of the highly expressed transcripts (HET) encoded multiple genes, strongly indicating that the counts may have been biased by the merging of different transcripts. TC00004 is ranked in the top five HET for all five tissues analyzed here, and encodes both a retinoblastoma-binding-like protein (E-value=0) and a senescence-associated protein (E-value= 5e-108). Fragmented transcripts are another source of error. The ribulose bisphosphate carboxylase small chain (RBCSC) protein is split into two transcripts with an overlapping amino acid sequence "ASNGGRVHC", TC13991 and TC23009, with length 201 and 332 nucleotides and expression counts 17.90 and 1403.8, respectively. The huge difference in counts indicates an erroneous normalization algorithm in determining counts. It is well known that RBCSC is highly expressed and expectedly TC23009 ranks fifth among HETs in the shoot. Furthermore, some transcripts are split into open reading frames that map to the same protein, although this should not have any significant bearing on the counts. It is proposed that studies analyzing differential expression based on the transcriptome should consider these artifacts, and providing intermediate assembled transcriptomes should be mandatory, possibly with a link to the raw sequence data (Bioproject).

Highlights

  • The lack of reproducibility of results in biology is a contentious subject[3,4]

  • Several online resources exist for chickpea genomes and transcriptomes

  • There were 60 unmapped transcripts, some of which are mitochondrial transcripts, some are contamination, and the rest have no match in the complete BLAST ‘nt’ database

Read more

Summary

Introduction

The lack of reproducibility of results in biology is a contentious subject[3,4]. The problem is compounded by recent technological advances generating “Big Data” involving multiple programs and pipelines[5,6]. Inferences based on these results should not be subject to the same, or ideally any, unpredictability. The availability of software used at each stage and the intermediate data generated is key in enabling debugging and tracking the veracity of results by subsequent researchers[7]. Chickpea (Cicer arietinum L.) is an important pulse crop having numerous nutritional and health benefits[8]. Several online resources exist for chickpea genomes and transcriptomes The 68th United Nations General Assembly has declared 2016 as the International Year of Pulses (IYP)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call