Abstract

Comparative sequence analyses, including such fundamental bioinformatics techniques as similarity searching, sequence alignment and phylogenetic inference, have become a mainstay for researchers studying type 1 Human Immunodeficiency Virus (HIV-1) genome structure and evolution. Implicit in comparative analyses is an underlying model of evolution, and the chosen model can significantly affect the results. In general, evolutionary models describe the probabilities of replacing one amino acid character with another over a period of time. Most widely used evolutionary models for protein sequences have been derived from curated alignments of hundreds of proteins, usually based on mammalian genomes. It is unclear to what extent these empirical models are generalizable to a very different organism, such as HIV-1–the most extensively sequenced organism in existence. We developed a maximum likelihood model fitting procedure to a collection of HIV-1 alignments sampled from different viral genes, and inferred two empirical substitution models, suitable for describing between-and within-host evolution. Our procedure pools the information from multiple sequence alignments, and provided software implementation can be run efficiently in parallel on a computer cluster. We describe how the inferred substitution models can be used to generate scoring matrices suitable for alignment and similarity searches. Our models had a consistently superior fit relative to the best existing models and to parameter-rich data-driven models when benchmarked on independent HIV-1 alignments, demonstrating evolutionary biases in amino-acid substitution that are unique to HIV, and that are not captured by the existing models. The scoring matrices derived from the models showed a marked difference from common amino-acid scoring matrices. The use of an appropriate evolutionary model recovered a known viral transmission history, whereas a poorly chosen model introduced phylogenetic error. We argue that our model derivation procedure is immediately applicable to other organisms with extensive sequence data available, such as Hepatitis C and Influenza A viruses.

Highlights

  • Every computational and statistical method used for comparative gene sequence analysis employ a stochastic model for estimating rates of evolutionary change, either explicitly or implicitly

  • Much like other empirical matrices, the HIV models assigned higher rates to those pairs that are separated by a single nucleotide substitution

  • To formally characterize the similarities in the substitution process across the eight matrices, we computed a neighbor-joining tree on the Markov processes defined by each matrix using the total variation metric (TVM) [31]

Read more

Summary

Introduction

Every computational and statistical method used for comparative gene sequence analysis employ a stochastic model for estimating rates of evolutionary change, either explicitly or implicitly. Codon models of MuseGaut [2] and Goldman-Yang [3] distinguished amino-acid altering (non-synonymous) and silent (synonymous) substitutions and have formed the basis of a popular and successful suite of methods for the analysis of selective pressures on coding sequences. A PAM matrix is derived from the inferred substitutions along a phylogenetic tree relating homologous sequences, by estimating the probability that any given amino acid residue in a protein will be replaced by any other residue after a pre-specified evolutionary interval. Karlin and Ghandour [6] and George, Barker, and Hunt [7] proposed methods of weighting differences based on chemical, functional, charge and structural properties of amino acids and computing replacement probabilities based on the similarity of the involved residues. A similar method based on pairwise amino-acid differences between homologous genes led to Academic Editor: Oliver Pybus, University of Oxford, United Kingdom

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.