Abstract

Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.

Highlights

  • Annotation of the repetitive content of a genome depends on the initial discovery of repeat families present in that genome (so called de novo identification, e.g. [1,2,3]), followed by homology-based annotation [4], in which tools are used to seek all recognizable members of those families

  • For each transposable elements (TEs) family, Dfam contains a multiple sequence alignment and a profile hidden Markov model (HMM) constructed from that alignment

  • The profile HMM search tool nhmmer [11] has been incorporated as a search engine for RepeatMasker, so that the Dfam profile library can be used by RepeatMasker to increase the amount of genomic sequence that can be identified as derived from TE activity

Read more

Summary

Introduction

Annotation of the repetitive content of a genome depends on the initial discovery of repeat families present in that genome (so called de novo identification, e.g. [1,2,3]), followed by homology-based annotation [4], in which tools are used to seek all recognizable members of those families. (iii) Another problem can arise when a low copy number element contains similarity to a high copy element, e.g. a complex repeat like SVA including an Alu. Even if RepeatMasker mistakenly annotates only a very small fraction of the high copy element as a fragment of the low copy element, this small fraction may overwhelm the count of true instances in that region of the seed alignment.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call