Abstract

Background Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges, particularly when working with entire proteomes (all protein coding sequences in a genome) from a large number of species. Methods We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and Perl and is designed to run within a UNIX environment. Results We have benchmarked VESPA and our results show that the method is consistent, performs well on both large scale and smaller scale datasets, and produces results in line with previously published datasets. Discussion Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with downstream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: http://www.mol-evol.org/VESPA.

Highlights

  • Estimating selective pressure variation across homologous protein-coding genes from different species is typically done by assessing the ratio of dN/dS, i.e., the number of non-synonymous substitutions per non-synonymous site as a function of the number of synonymous substitutions per synonymous site

  • It is advisable to explore a variety of multiple sequence alignment (MSA) methods for every gene family (Muller et al, 2010), and VESPA facilitates this the user to compare these different approaches

  • The metal_compare function in VESPA allows alignment approaches to be compared and a single MSA of best fit is retained for each gene family (i.e., Gene Family A may have an MSA from MUSCLE while Gene Family B may have an MSA from PRANK)

Read more

Summary

Introduction

Estimating selective pressure variation across homologous protein-coding genes from different species is typically done by assessing the ratio of dN/dS, i.e., the number of non-synonymous substitutions per non-synonymous site (dN) as a function of the number of synonymous substitutions per synonymous site (dS). One of the most popular methods is codeML from the PAML software package (Yang, 2007) The strength of this approach is the application of flexible codon-based models capable of assessing variation in selective pressures at two levels: (i) across sites in an alignment and (ii) across sites in a predefined, or ‘‘foreground’’ lineage on a phylogenetic tree (Yang & Dos Reis, 2011). Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges, when working with entire proteomes (all protein coding sequences in a genome) from a large number of species. VESPA may be found at the following website: http://www.mol-evol.org/VESPA

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call