Abstract

BackgroundRNA-seq based on short reads generated by next generation sequencing technologies has become the main approach to study differential gene expression. Until now, the main applications of this technique have been to study the variation of gene expression in a whole organism, tissue or cell type under different conditions or at different developmental stages. However, RNA-seq also has a great potential to be used in evolutionary studies to investigate gene expression divergence in closely related species.ResultsWe show that the published genomes and annotations of the three closely related Drosophila species D. melanogaster, D. simulans and D. mauritiana have limitations for inter-specific gene expression studies. This is due to missing gene models in at least one of the genome annotations, unclear orthology assignments and significant gene length differences in the different species. A comprehensive evaluation of four statistical frameworks (DESeq2, DESeq2 with length correction, RPKM-limma and RPKM-voom-limma) shows that none of these methods sufficiently accounts for inter-specific gene length differences, which inevitably results in false positive candidate genes. We propose that published reference genomes should be re-annotated before using them as references for RNA-seq experiments to include as many genes as possible and to account for a potential length bias. We present a straight-forward reciprocal re-annotation pipeline that allows to reliably compare the expression for nearly all genes annotated in D. melanogaster.ConclusionsWe conclude that our reciprocal re-annotation of previously published genomes facilitates the analysis of significantly more genes in an inter-specific differential gene expression study. We propose that the established pipeline can easily be applied to re-annotate other genomes of closely related animals and plants to improve comparative expression analyses.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-2646-x) contains supplementary material, which is available to authorized users.

Highlights

  • RNA-seq based on short reads generated by generation sequencing technologies has become the main approach to study differential gene expression

  • Length difference in reference genes introduces biases in differential expression studies Since we find a high number of gene models with length differences > 49 bp in the published annotations and after the direct re-annotation (Fig. 1a and b; Additional file 2: Table S2), the three Drosophila genomes are excellent models to test whether length differences larger than the read length do influence the statistical analysis of differential gene expression

  • After length correction (RPKM) and normalization with voom, we found a significant correlation between gene length differences and log2-fold changes when the published annotations and the directly re-annotated reference gene sets were used this was slightly reduced compared to the reads per kilobase per million (RPKM)-limma analysis, especially for the D. simulans and D. mauritiana comparison. (Figure 2d, Additional file 6: Figure S4, Table 2)

Read more

Summary

Introduction

RNA-seq based on short reads generated by generation sequencing technologies has become the main approach to study differential gene expression. RNA-seq has a great potential to be used in evolutionary studies to investigate gene expression divergence in closely related species. Comparative studies of gene expression have been used to understand the regulation of a wide range of biological processes. The comparison of gene expression between both closely [9,10,11,12,13,14,15,16] and distantly related species [17,18,19,20] has great potential to help understand phenotypic divergence and species adaptations at a mechanistic level [21]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call