“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files

Marc Bailly-Bechet,Annabelle Haudry,Emmanuelle Lerat

doi:10.1186/1759-8753-5-13

Abstract

BackgroundOf the different bioinformatic methods used to recover transposable elements (TEs) in genome sequences, one of the most commonly used procedures is the homology-based method proposed by the RepeatMasker program. RepeatMasker generates several output files, including the .out file, which provides annotations for all detected repeats in a query sequence. However, a remaining challenge consists of identifying the different copies of TEs that correspond to the identified hits. This step is essential for any evolutionary/comparative analysis of the different copies within a family. Different possibilities can lead to multiple hits corresponding to a unique copy of an element, such as the presence of large deletions/insertions or undetermined bases, and distinct consensus corresponding to a single full-length sequence (like for long terminal repeat (LTR)-retrotransposons). These possibilities must be taken into account to determine the exact number of TE copies.ResultsWe have developed a perl tool that parses the RepeatMasker .out file to better determine the number and positions of TE copies in the query sequence, in addition to computing quantitative information for the different families. To determine the accuracy of the program, we tested it on several RepeatMasker .out files corresponding to two organisms (Drosophila melanogaster and Homo sapiens) for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families.ConclusionsOur tool provides access to detailed information concerning the TE content in a genome at the family level from the .out file of RepeatMasker. This information includes the exact position and orientation of each copy, its proportion in the query sequence, and its quality compared to the reference element. In addition, our tool allows a user to directly retrieve the sequence of each copy and obtain the same detailed information at the family level when a local library with incomplete TE class/subclass information was used with RepeatMasker. We hope that this tool will be helpful for people working on the distribution and evolution of TEs within genomes.

Highlights

Of the different bioinformatic methods used to recover transposable elements (TEs) in genome sequences, one of the most commonly used procedures is the homology-based method proposed by the RepeatMasker program
These methods group three main types of approaches to recover TE sequences: homology-based approaches that search for a reference sequence in a query genome; structure-based approaches that search for particular structural features of certain TE classes, such as the presence of two long terminal repeats (LTRs) at the extremities of Long terminal repeat (LTR)-retrotransposons; and de novo approaches that principally use the repetitive nature of TEs to discover them
One of the most commonly used procedures to find occurrences of known TEs remains the homology-based method proposed by the RepeatMasker program [13] because it is easy to use, rapid, and efficient [14,15]

Summary

Results

We have developed a perl tool that parses the RepeatMasker .out file to better determine the number and positions of TE copies in the query sequence, in addition to computing quantitative information for the different families. To determine the accuracy of the program, we tested it on several RepeatMasker .out files corresponding to two organisms (Drosophila melanogaster and Homo sapiens) for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families

Conclusions

Background

Results and discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mobile DNA	Publication Date: May 1, 2014
Citations: 183	License type: cc-by

R Discovery Prime

R Discovery Prime

“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mobile DNA

Lead the way for us

Similar Papers

Comparative Analysis of Transposable Elements Highlights Mobilome Diversity and Evolution in Vertebrates
Domitille Chalopin ... Magali Naville
Genome Biology and Evolution | VOL. 7
Domitille Chalopin, et. al.Domitille Chalopin ... Magali Naville
09 Jan 2015
Genome Biology and Evolution | VOL. 7

A Snapshot of Histone Modifications within Transposable Elements in Drosophila Wild Type Strains
Rita Rebollo ... Béatrice Horard
PLoS ONE | VOL. 7
Rita Rebollo, et. al.Rita Rebollo ... Béatrice Horard
04 Sep 2012
PLoS ONE | VOL. 7

Genome Size and Transposable Element Content as Determined by High-Throughput Sequencing in Maize and Zea luxurians
Maud I Tenaillon ... Matthew B Hufford
Genome Biology and Evolution | VOL. 3
Maud I Tenaillon, et. al.Maud I Tenaillon ... Matthew B Hufford
01 Jan 2010
Genome Biology and Evolution | VOL. 3

The evolution of transposable elements in natural populations of self-fertilizing Arabidopsis thaliana and its outcrossing relative Arabidopsis lyrata
Steven Lockton ... Brandon S Gaut
BMC Evolutionary Biology | VOL. 10
Steven Lockton, et. al.Steven Lockton ... Brandon S Gaut
01 Jan 2009
BMC Evolutionary Biology | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mobile DNA