Abstract
BackgroundGenome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed.ResultsWe have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases.ConclusionsThese results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours http://last.cbrc.jp/.
Highlights
Genome sequence alignments form the basis of much research
Reversed genomes are convenient for estimating the rate of spurious alignments [6], because the reversed genome has the same composition and sequence complexity as the actual genome, but has no homology to any real genome
Repeat masking We have shown that standard E-value calculations predict the rate of spurious alignment quite accurately, if tandem repeats are carefully masked
Summary
Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Many genome alignment algorithms have been developed, e.g. reviewed by [5]. All of these algorithms require selection of various mundane but critical parameters. This study aims to reveal the influence of these and other parameters, and to guide their selection for accurate genome alignment. We investigate the following six facets of genome alignment: Alignment score cutoff In the classic alignment framework, it is necessary to choose an alignment score cutoff: low enough to find weak homologies, but high enough to avoid too many
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have