Abstract

BackgroundRepetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408–4410, 2019) developed a novel software tool dna-brnn to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements.ResultsWe have developed the methods of dna-brnn further and engineered a new software tool DeepGRP. This combines the basic concepts of Li (Bioinformatics 35:4408–4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by DeepGRP, when compared to dna-brnn. DeepGRP predicts two additional classes of repeats (compared to dna-brnn) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that DeepGRP predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. DeepGRP is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of DeepGRP is approx. 1.8 times faster than dna-brnn, approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database.ConclusionsBy incorporating methods from neural machine translation, DeepGRP achieves a consistent improvement of the quality of the predictions compared to dna-brnn. Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, DeepGRP provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation.

Highlights

  • As of May 2021, the Genomes OnLine Database (GOLD) [1] lists about 35 000 eukaryotic genome sequencing projects

  • In contrast to [11] we have systematically explored the latitude of the model by applying a well-tested hyperparameter optimization technique to determine seven hyperparameters used in DeepGRP

  • These and more details of the training are well documented to allow reproducing our results. In addition to these technical contributions, we show that the recurrent neural networks (RNNs)-based approach of DeepGRP can handle two additional classes of repeats, one of which Li [11] considered not accessible by this approach

Read more

Summary

Introduction

As of May 2021, the Genomes OnLine Database (GOLD) [1] lists about 35 000 eukaryotic genome sequencing projects. The function of repetitive elements has been discussed for a long time [3] and only recently has the importance of repeats in cellular processes begun to open up [4]. Repetitive elements are important as binding regions for proteins, for example, involved in cellular replication [5] and they contain signals for transcription, chromatin assembly, nuclear localization [6] or influence expression of coding sequences [7]. Identifying and classifying repeats is an important step in genome annotation This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Li (Bioinformatics 35:4408–4410, 2019) developed a novel software tool dna-brnn to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call