Abstract

There are numerous sources of variation in the rate of synonymous substitutions inside genes, such as direct selection on the nucleotide sequence, or mutation rate variation. Yet scans for positive selection rely on codon models which incorporate an assumption of effectively neutral synonymous substitution rate, constant between sites of each gene. Here we perform a large-scale comparison of approaches which incorporate codon substitution rate variation and propose our own simple yet effective modification of existing models. We find strong effects of substitution rate variation on positive selection inference. More than 70% of the genes detected by the classical branch-site model are presumably false positives caused by the incorrect assumption of uniform synonymous substitution rate. We propose a new model which is strongly favored by the data while remaining computationally tractable. With the new model we can capture signatures of nucleotide level selection acting on translation initiation and on splicing sites within the coding region. Finally, we show that rate variation is highest in the highly recombining regions, and we propose that recombination and mutation rate variation, such as high CpG mutation rate, are the two main sources of nucleotide rate variation. Although we detect fewer genes under positive selection in Drosophila than without rate variation, the genes which we detect contain a stronger signal of adaptation of dynein, which could be associated with Wolbachia infection. We provide software to perform positive selection analysis using the new model.

Highlights

  • Detecting the selective pressure affecting protein-coding genes is an important component of molecular evolution and evolutionary genomics

  • Simulations Site Models We have simulated four data sets using various flavors of the M8 model: a data set without rate variation, a data set with site rate variation, a data set with gamma-distributed codon rate variation, and a data set with codon 3-rate variation

  • In the absence of rate variation, the statistical performance of the four methods is very similar, even though the M8 model without rate variation has a slightly better ROC, a false positive rate (FPR) which is closer to the theoretical expectation

Read more

Summary

Introduction

Detecting the selective pressure affecting protein-coding genes is an important component of molecular evolution and evolutionary genomics. Codon models are one of the main tools used to infer selection on protein-coding genes (Koonin and Wolf 2010) This is done by comparing the rate of nonsynonymous substitutions (dN) that are changing the amino acid sequence with the rate of synonymous substitutions (dS) that do not affect this amino acid sequence. There is overwhelming evidence of negative and positive selection acting on the amino acid sequence of the proteins (Boyko et al 2008), synonymous substitutions affecting the protein-coding genes are assumed to be effectively neutral in most current models. The synonymous substitution rate can be used as a proxy for the neutral substitution rate, and comparison between dN and dS can be used to identify selection acting on the level of amino acids (Yang and Bielawski 2000)

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call