Abstract

BackgroundWith the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks.ResultsWe find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the Drosophila group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in Drosophila non-coding sequences if provided with the true alignments.ConclusionWe have developed a method to generate benchmarks for multiple alignments of Drosophila non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.

Highlights

  • With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools

  • Numerous attempts have been made to develop accurate and efficient methods to solve the multiple sequence alignment problem, offering us much flexibility, as well as difficulty, in choosing the most appropriate tool(s) for the task. Another important task related to multiple alignment is the annotation of insertions and deletions in the alignment, a task that has received some attention in recent years [7,8,9,10,11,12] in light of the realization that indels may be responsible for genomic variation as much as nucleotide substitutions are [13], and that indels may affect regional mutation rates [14]

  • 4) We find that data sets with an excess of deletions over insertions are more amenable to accurate alignment than those with an excess of insertions, suggesting an implicit bias with respect to their treatment of indels, even though none of the evaluated tools explicitly makes a distinction between insertions and deletions

Read more

Summary

Introduction

With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Numerous attempts have been made to develop accurate and efficient methods to solve the multiple sequence alignment problem (reviewed in [3,4,5,6]), offering us much flexibility, as well as difficulty, in choosing the most appropriate tool(s) for the task Another important task related to multiple alignment is the annotation of insertions and deletions (indels) in the alignment, a task that has received some attention in recent years [7,8,9,10,11,12] in light of the realization that indels may be responsible for genomic variation as much as nucleotide substitutions are [13], and that indels may affect regional mutation rates [14]. It is not clear how to choose “correct” settings for these parameters and how to assess if the simulated sequences mimic real data well enough for claims about alignment accuracy, both in relative terms (i.e., comparison of tools) and in the absolute, to generalize from the benchmarks to the real world setting We address these questions in this work, whose main contributions are the following

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call