Abstract

BackgroundWith rapid advancements in technology, the sequences of thousands of species’ genomes are becoming available. Within the sequences are repeats that comprise significant portions of genomes. Successful annotations thus require accurate discovery of repeats. As species-specific elements, repeats in newly sequenced genomes are likely to be unknown. Therefore, annotating newly sequenced genomes requires tools to discover repeats de-novo. However, the currently available de-novo tools have limitations concerning the size of the input sequence, ease of use, sensitivities to major types of repeats, consistency of performance, speed, and false positive rate.ResultsTo address these limitations, I designed and developed Red, applying Machine Learning. Red is the first repeat-detection tool capable of labeling its training data and training itself automatically on an entire genome. Red is easy to install and use. It is sensitive to both transposons and simple repeats; in contrast, available tools such as RepeatScout and ReCon are sensitive to transposons, and WindowMasker to simple repeats. Red performed consistently well on seven genomes; the other tools performed well only on some genomes. Red is much faster than RepeatScout and ReCon and has a much lower false positive rate than WindowMasker. On human genes with five or more copies, Red was more specific than RepeatScout by a wide margin. When tested on genomes of unusual nucleotide compositions, Red located repeats with high sensitivities and maintained moderate false positive rates. Red outperformed the related tools on a bacterial genome. Red identified 46,405 novel repetitive segments in the human genome. Finally, Red is capable of processing assembled and unassembled genomes.ConclusionsRed’s innovative methodology and its excellent performance on seven different genomes represent a valuable advancement in the field of repeats discovery.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0654-5) contains supplementary material, which is available to authorized users.

Highlights

  • With rapid advancements in technology, the sequences of thousands of species’ genomes are becoming available

  • Evaluation measures The following criteria were used in this study to evaluate Red, RepeatScout, ReCon, and WindowMasker: Sensitivity (SN), Specificity (SP), Percentage Predicted (PP), False Positive Length (FPL), Potential Repeats (PR), Time, and Memory

  • Red is much faster than RepeatScout and ReCon The difference in speed between Red and RepeatScout and ReCon is clear when it comes to large genomes

Read more

Summary

Results

First, I define the criteria to evaluate Red and the three related tools. Evaluation measures The following criteria were used in this study to evaluate Red, RepeatScout, ReCon, and WindowMasker: Sensitivity (SN), Specificity (SP), Percentage Predicted (PP), False Positive Length (FPL), Potential Repeats (PR), Time, and Memory. On the genomes of the Dictyostelium discoideum, the Plasmodium falciparum, and the Mycobacterium tuberculosis, which have unusual nucleotide compositions, the FPLs of Red were 6–11 times lower than those of WindowMasker These results show that Red achieved high sensitivity, consistent performance, and high speed while maintaining low to moderate false positive rates. If repeats detected by Redsr included non-repetitive sequences, the percentage of predicted repeats and the total length of potential repeats would be much higher and the the SPexons would be much lower than those obtained by Reddm6 This was not the case because both models performed comparably when evaluated according to these three criteria. These results demonstrate the successful application of Red to unassembled genomes

Background
Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call