Abstract

BackgroundMany important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are ‘novel’ compared to the others in the same dataset, and low weights to sequences that are over-represented.ResultsWe formalise this principle by rigorously defining the evolutionary ‘novelty’ of a sequence within an alignment. This results in new sequence weights that we call ‘phylogenetic novelty scores’. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column—important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes.ConclusionsOur phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.

Highlights

  • Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset

  • phylogenetic novelty scores (PNS) shows similar trends to previous weighting schemes Weighting scheme by Henikoff and Henikoff [5] (HH94) and Weighting scheme by Gerstein et al [9] (GSC94), assigning higher weights to phylogenetically isolated taxa and smaller weights to taxa within clades with many other closely related taxa (Fig. 2)

  • R = Smax − Sobs = log2 B − − p(j) log2 p(j) j=1 where p(j) is the frequency of character j at a given alignment column, and B is the number of characters ( B = 4 for nucleotides and B = 20 for amino acids). ( Smax is the maximum possible entropy at the considered position, equal to log2 B, while Sobs is the observed value.) Typically, the p(j) are inferred from the observed character frequencies at an alignment column; as we have shown, our PNS can significantly improve the inference of these frequencies, and of conservation scores

Read more

Summary

Introduction

Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are ‘novel’ compared to the others in the same dataset, and low weights to sequences that are over-represented. This method has the advantage of being very fast to calculate, and of giving higher weights to sequences with more rare characters that are, likely more distantly related

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call