It has become clear that repetitive sequences have played multiple roles in eukaryotic genome evolution including increasing genetic diversity through mutation, changes in gene expression and facilitating generation of novel genes. However, identification of repetitive elements can be difficult in the ab initio manner. Currently, some classical ab initio tools of finding repeats have already presented and compared. The completeness and accuracy of detecting repeats of them are little pool. To this end, we proposed a new ab initio repeat finding tool, named HashRepeatFinder, which is based on hash index and word counting. Furthermore, we assessed the performances of HashRepeatFinder with other two famous tools, such as RepeatScout and Repeatfinder, in human genome data hg19. The results indicated the following three conclusions: (1) The completeness of HashRepeatFinder is the best one among these three compared tools in almost all chromosomes, especially in chr9 (8 times of RepeatScout, 10 times of Repeatfinder); (2) in terms of detecting large repeats, HashRepeatFinder also performed best in all chromosomes, especially in chr3 (24 times of RepeatScout and 250 times of Repeatfinder) and chr19 (12 times of RepeatScout and 60 times of Repeatfinder); (3) in terms of accuracy, HashRepeatFinder can merge the abundant repeats with high accuracy.
Read full abstract