Abstract

BackgroundThe current growth in DNA sequencing techniques makes of genome annotation a crucial task in the genomic era. Traditional gene finders focus on protein-coding sequences, but they are far from being exhaustive. The number of this kind of genes continuously increases due to new experimental data and development of improved bioinformatics algorithms.ResultsIn this context, AnABlast represents a novel in silico strategy, based on the accumulation of short evolutionary signals identified by protein sequence alignments of low score. This strategy potentially highlights protein-coding regions in genomic sequences regardless of traditional homology or translation signatures. Here, we analyze the evolutionary information that the accumulation of these short signals encloses. Using the Drosophila melanogaster genome, we stablish optimal parameters for the accurate gene prediction with AnABlast and show that this new strategy significantly contributes to add genes, exons and pseudogenes regions, yet to be discovered in both already annotated and new genomes.ConclusionsAnABlast can be freely used to analyze genomic regions of whole genomes where it contributes to complete the previous annotation.

Highlights

  • The current growth in DNA sequencing techniques makes of genome annotation a crucial task in the genomic era

  • The fine-tuning procedure was tested with an early version of a protein database, and we show that many new genes predicted by this algorithm are really true genes that have been incorporated into the current genome annotation of this organism

  • We show that AnABlast is useful to discover small open reading frames (ORF) and fossil sequences that are hidden to conventional gene finder algorithms, and show how this new strategy can contribute to discover the complete set of protein-coding regions of a whole genome

Read more

Summary

Introduction

The current growth in DNA sequencing techniques makes of genome annotation a crucial task in the genomic era. Traditional gene finders focus on protein-coding sequences, but they are far from being exhaustive. The number of this kind of genes continuously increases due to new experimental data and development of improved bioinformatics algorithms. Its genome was sequenced in 2000, and 13,601 protein-coding genes were initially annotated, coming from the integration of the two used gene finders, which respectively predicted 13,189 and 17,464 genes [4]. From this milestone, the number of fruit-fly genes has changed, and numerous and significant discrepancies have arisen [5]. Nowadays the FlyBase database put this number at 14,133 [6], showing that the number of genes is constantly increasing over time, and a Casimiro-Soriguer et al BMC Genomics (2020) 21:210 greater increase is expected to come from the discovery of new kinds of genes, such as those shorter than 100 amino acids, which in the fruit-fly genome could account for thousands of them [7]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.