Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments.

Daniel A Pollard,Alan M Moses,Venky N Iyer,Michael B Eisen

doi:10.1186/1471-2105-7-376

Abstract

BackgroundMolecular evolutionary studies of noncoding sequences rely on multiple alignments. Yet how multiple alignment accuracy varies across sequence types, tree topologies, divergences and tools, and further how this variation impacts specific inferences, remains unclear.ResultsHere we develop a molecular evolution simulation platform, CisEvolver, with models of background noncoding and transcription factor binding site evolution, and use simulated alignments to systematically examine multiple alignment accuracy and its impact on two key molecular evolutionary inferences: transcription factor binding site conservation and divergence estimation. We find that the accuracy of multiple alignments is determined almost exclusively by the pairwise divergence distance of the two most diverged species and that additional species have a negligible influence on alignment accuracy. Conserved transcription factor binding sites align better than surrounding noncoding DNA yet are often found to be misaligned at relatively short divergence distances, such that studies of binding site gain and loss could easily be confounded by alignment error. Divergence estimates from multiple alignments tend to be overestimated at short divergence distances but reach a tool specific divergence at which they cease to increase, leading to underestimation at long divergences. Our most striking finding was that overall alignment accuracy, binding site alignment accuracy and divergence estimation accuracy vary greatly across branches in a tree and are most accurate for terminal branches connecting sister taxa and least accurate for internal branches connecting sub-alignments.ConclusionOur results suggest that variation in alignment accuracy can lead to errors in molecular evolutionary inferences that could be construed as biological variation. These findings have implications for which species to choose for analyses, what kind of errors would be expected for a given set of species and how multiple alignment tools and phylogenetic inference methods might be improved to minimize or control for alignment errors.

Highlights

Molecular evolutionary studies of noncoding sequences rely on multiple alignments
CisEvolver takes an ancestral DNA sequence and evolves it along a mutation guide tree, producing sequences for which we know the true alignment. The utility of such a simulation is that the sequences can be re-aligned using standard alignment tools and the accuracy of the tool alignment as well as the accuracy of any inference from the tool alignment can be measured by comparison with the true alignment
In cases where the error in an inference is due to both alignment error and error in the inference method itself, the contribution of alignment error to the total inference error can be directly measured by com

Summary

Introduction

Molecular evolutionary studies of noncoding sequences rely on multiple alignments. How multiple alignment accuracy varies across sequence types, tree topologies, divergences and tools, and further how this variation impacts specific inferences, remains unclear. New tools have been developed for fast and accurate alignment of long stretches of genomic sequence (reviewed in [2,3,4]) and benchmarking studies have begun to address the accuracy of these pairwise [5,6] and multiple [7,8] alignment tools under various evolutionary scenarios. Knowing the nucleotide-level accuracy of alignment tools greatly informs decisions about which tools to use and which species to compare, but the impact of alignment error on evolutionary studies of noncoding sequences is only just beginning to be explored [6,8]. One would use the set of species that maximize information for an acceptable amount of error in an estimate

Methods

Results

Discussion

Conclusion