Abstract

BackgroundThe prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses. Predicting ancestral sequences is not a simple procedure and relies on accurate alignments and phylogenies. Several algorithms exist based on Maximum Parsimony or Maximum Likelihood methods but many current implementations are unable to process residues with gaps, which may represent insertion/deletion (indel) events or sequence fragments.ResultsHere we present a new algorithm, GASP (Gapped Ancestral Sequence Prediction), for predicting ancestral sequences from phylogenetic trees and the corresponding multiple sequence alignments. Alignments may be of any size and contain gaps. GASP first assigns the positions of gaps in the phylogeny before using a likelihood-based approach centred on amino acid substitution matrices to assign ancestral amino acids. Important outgroup information is used by first working down from the tips of the tree to the root, using descendant data only to assign probabilities, and then working back up from the root to the tips using descendant and outgroup data to make predictions. GASP was tested on a number of simulated datasets based on real phylogenies. Prediction accuracy for ungapped data was similar to three alternative algorithms tested, with GASP performing better in some cases and worse in others. Adding simple insertions and deletions to the simulated data did not have a detrimental effect on GASP accuracy.ConclusionsGASP (Gapped Ancestral Sequence Prediction) will predict ancestral sequences from multiple protein alignments of any size. Although not as accurate in all cases as some of the more sophisticated maximum likelihood approaches, it can process a wide range of input phylogenies and will predict ancestral sequences for gapped and ungapped residues alike.

Highlights

  • The prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses

  • Testing the GASP algorithm The simulated trees and alignments were run through the GASP algorithm

  • Because the 'real' sequence of each simulated node was known, it was possible to determine the accuracy of GASP predictions

Read more

Summary

Introduction

The prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses. Predicting ancestral sequences is not a simple procedure and relies on accurate alignments and phylogenies. Predicting ancestral protein sequences from a multiple sequence alignment is a useful tool in bioinformatics [1]. Many evolutionary sequence analyses require such predictions in order to map substitutions to a particular lineage The predicted ancestral sequence alone may provide a more representative functional sequence than a simple consensus sequence constructed from an alignment. Predicting ancestral sequences is not a simple procedure. It relies on a quality alignment plus an accurate – and correctly rooted – phylogenetic tree. Strict consensus methods are quick but can suffer from overrepresentation of larger clades of related sequences, which (page number not for citation purposes)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call