Integration of Alignment and Phylogeny in the Whole-Genome Era

Hongying Sun

doi:10.7936/k75m63vq

Abstract

OF THE DISSERTATION Integration of Alignment and Phylogeny in the Whole-Genome Era by Hongtao Sun Doctor of Philosophy in Computer Science Washington University in St. Louis, 2015 Professor Jeremy Buhler, Chair With the development of new sequencing techniques, whole genomes of many species have become available. This huge amount of data gives rise to new opportunities and challenges. These new sequences provide valuable information on relationships among species, e.g. genome rearrangement and conservation. One of the principal ways to investigate such information is multiple sequence alignment (MSA). Currently, there is large amount of MSA data on the internet, such as the UCSC genome database, but how to e↵ectively use this information to solve classical and new problems is still an area lacking of exploration. In this thesis, we explored how to use this information in four problems, i.e. sequence similarity search, multiple alignment improvement, short read mapping, and genome rearrangement inference. The first problem is sequence similarity search, i.e., given a query sequence, search its similar sequences in a database. The expansion of DNA sequencing capacity has enabled the sequencing of whole genomes from a number of related species. These genomes can be combined in a multiple alignment that provides useful information about the evolutionary x history at each genomic locus. One area in which evolutionary information can productively be exploited is in aligning a new sequence to a database of existing, aligned genomes. However, existing high-throughput alignment tools are not designed to work e↵ectively with multiple genome alignments. We introduce PhyLAT, the Phylogenetic Local Alignment Tool, to compute local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyLAT uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. It combines a probabilistic approach to alignment with seeding and expansion heuristics to accelerate discovery of significant alignments. We provide evidence, using alignments of human chromosome 22 against a 5-species alignment from the UCSC Genome Browser database, that PhyLAT’s alignments are more accurate than those of other commonly used programs, including BLAST, POY, MAFFT, MUSCLE, and CLUSTAL. PhyLAT also identifies more alignments in coding DNA than does pairwise alignment alone. Finally, our tool determines the evolutionary relationship of query sequences to the database more accurately than do POY, RAxML, EPA, or pplacer. The second problem is multiple alignment quality improvement, i.e., given a multiple alignment, correct any wrong matches, i.e., matches between non-orthologous characters (bases or residues). This is important to all other data analysis based on multiple alignments. However, existing methods either compute alignments non-iteratively or use complex models which are very time-consuming and have the risk of overfitting. We developed an optimization algorithm to iteratively refine the multiple alignment quality. In each iteration, we take out one sequence from the multiple alignment, and realign it to the rest of the sequences using our phylogeny-aware alignment framework. We tested several strategies for picking sequences, i.e., picking out the most distant species from the rest species, picking out the closest species from the rest species and randomly picking out a sequence. Experiment xi results showed that di↵erent picking strategies gave very similar results. In other words, our method is very insensitive to sequence picking strategy, which makes it a stable algorithm for improving alignments of any number of sequences. The results showed that our method is more accurate than existing methods, i.e. MAFFT, Clustal-O, and MAVID, on test data from three sets of species from the UCSC genome database. The third problem is phylogeny-aware short read mapping using multiple informant sequences. Given a set of short reads from next-generation sequencing results, mapping them back to their orthologous locations in a reference genome is called short read mapping. This is a new problem arising with the development of next-generation sequencing techniques. Existing methods cannot deal with indels in alignments, and cannot do interspecies mapping. We developed a model, PhyMap, to align a read to a multiple alignment allowing mismatches and indels. PhyMap computes local alignments of a query sequence against a fixed multiplegenome alignment of closely related species. PhyMap uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. We showed theoretically that our model can di↵erentiate orthologous sequences from paralogous sequences. Thus our algorithm can align short reads to their homologous positions in reference sequences. Our experiment results have proved this and showed that our model can di↵erentiate between orthologous and paralogous alignments. Furthermore, we compared our method with other popular short read mapping tools (BWA, BOWTIE and BLAST) on simulated data, and found that our method can map more reads to their orthologous locations in their closely-related species’ genomes than any one of them. The fourth problem is genome rearrangement inference, i.e., given a set of orthologous alignments along with the genomic orders in each aligned sequence and a set of new sequences

Full Text