Combined Evidence Annotation of Transposable Elements in Genome Sequences

Hadi Quesneville,Dominique Anxolabehere,Casey M Bergman,Danielle Nouaud,Delphine Autard,Michael Ashburner,Olivier Andrieu

doi:10.1371/journal.pcbi.0010022

Hadi Quesneville, Dominique Anxolabehere + Show 5 more

Open Access

https://doi.org/10.1371/journal.pcbi.0010022

Copy DOI

Abstract

Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated “TE models” in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n = 6,013) than previously identified (n = 1,572). Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1). We also estimated that 518 TE copies (8.6%) are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other species in the genus Drosophila.

Highlights

Transposable elements (TEs) are mobile, repetitive DNA sequences that constitute a structurally dynamic component of genomes
We initially tested three methods for TE prediction: (i) BLASTER using BLASTN followed by chaining with MATCHER (BLRn), (ii) RepeatMasker using default parameters (RM), and (iii) RM using default parameters followed by chaining with MATCHER (RMm)
TEs have been predicted anonymously using four different methods: (i) an all-by-all genome comparison with BLASTER using BLASTN followed by chaining with MATCHER and grouping with GROUPER (BLRa), (ii) RECON, using default parameters, (iii) BLASTER using TBLASTX with the entire Repbase Update as the database, followed by chaining with MATCHER (BLRtx), and (iv) a hidden Markov model that detects TE sequences based on nucleotide composition (TEHMM)

Summary

Introduction

Transposable elements (TEs) are mobile, repetitive DNA sequences that constitute a structurally dynamic component of genomes. TEs represent quantitatively important components of genome sequences (e.g., 44.4% of the human genome; [1]), and there is no doubt that modern genomic DNA has evolved in close association with TEs. TEs show high species specificity, and the number and types of TE can differ quite dramatically between even closely related organisms. Some TE insertions may even have become domesticated to play roles in the normal functions of the host (see [2] for review). Despite their manifold effects, abundance, and ubiquity, we understand very little about most aspects of TE biology

Methods

Results

Discussion

Conclusion