Abstract

Transposable Elements (TEs) are mobile DNA sequences that make up significant fractions of amniote genomes. However, they are difficult to detect and annotate ab initio because of their variable features, lengths and clade-specific variants. We have addressed this problem by refining and developing a Comprehensive ab initio Repeat Pipeline (CARP) to identify and cluster TEs and other repetitive sequences in genome assemblies. The pipeline begins with a pairwise alignment using krishna, a custom aligner. Single linkage clustering is then carried out to produce families of repetitive elements. Consensus sequences are then filtered for protein coding genes and then annotated using Repbase and a custom library of retrovirus and reverse transcriptase sequences. This process yields three types of family: fully annotated, partially annotated and unannotated. Fully annotated families reflect recently diverged/young known TEs present in Repbase. The remaining two types of families contain a mixture of novel TEs and segmental duplications. These can be resolved by aligning these consensus sequences back to the genome to assess copy number vs. length distribution. Our pipeline has three significant advantages compared to other methods for ab initio repeat identification: 1) we generate not only consensus sequences, but keep the genomic intervals for the original aligned sequences, allowing straightforward analysis of evolutionary dynamics, 2) consensus sequences represent low-divergence, recently/currently active TE families, 3) segmental duplications are annotated as a useful by-product. We have compared our ab initio repeat annotations for 7 genome assemblies to other methods and demonstrate that CARP compares favourably with RepeatModeler, the most widely used repeat annotation package.

Highlights

  • Thousands of genomes have been sequenced thanks to decreased cost and increased speed of DNA sequencing methods

  • The explosion of genome sequences has expanded our knowledge of repetitive DNA, which is an important component of the genomes of almost all eukaryotes

  • comprehensive ab initio repeat pipeline (CARP) is based on whole-genome pairwise local alignment, followed by clustering and consensus generation from clusters. This means all consensus sequences generated by CARP can be traced back to their input sequences and the original genomic sequence intervals of the input sequences

Read more

Summary

Introduction

Thousands of genomes have been sequenced thanks to decreased cost and increased speed of DNA sequencing methods. The explosion of genome sequences has expanded our knowledge of repetitive DNA, which is an important component of the genomes of almost all eukaryotes. Repetitive DNA is made up of sequences that have been duplicated. Superior ab initio identification and characterisation of TEs and segmental duplications has not yet submitted them to a repository. This assembly is available upon request to other investigators wishing to work on transposable elements upon request to Prof Guojie Zhang of Copenhagen University, email:guojie.zhang@bio. This assembly is available upon request to other investigators wishing to work on transposable elements upon request to Prof Guojie Zhang of Copenhagen University, email:guojie.zhang@bio. ku.dk

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call