Abstract

Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein family.

Highlights

  • Today accurate multiple sequence alignment (MSA) is frequently needed in genomics and molecular biology studies

  • To assess the advantage of our Tandem repeats (TRs)-aware alignment algorithm, ProGraphMSA+TR was executed either with no prior knowledge of TR units, with true TR units as known from the simulation, or with TR information reconstructed by the TR predictor TRUST [24]

  • The performance has been measured with regard to (i) the number of correctly aligned character pairs as compared with the true reference alignment and (ii) the number of inferred TR unit indels

Read more

Summary

Introduction

Today accurate multiple sequence alignment (MSA) is frequently needed in genomics and molecular biology studies. In an alignment defined by the evolution of sequence residues (rather than by its molecular structure), characters in the same column are assumed to be homologous, indicating that they have evolved from a common ancestral character. A similar approach has been applied to next-generation sequences from environmental samples to provide more accurate extensions of reference alignments [2]. Recent drive for biologically more meaningful alignments [3] included developments to account for special sequence features such as protein domains, repeats, rearrangements and promoter regions [4,5,6,7,8,9]. We focus on improving the strategy for aligning sequences with tandem repeats (TRs)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call