Abstract

Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp.Electronic supplementary materialThe online version of this article (doi:10.1186/s13059-015-0688-z) contains supplementary material, which is available to authorized users.

Highlights

  • Multiple sequence alignments (MSAs) of large datasets, containing several thousand to many tens of thousands of sequences, are used for estimating the gene family tree for multi-copy genes, estimating viral evolution, detecting remote homology, predicting the contact map between proteins [1], and inferring deep evolution [2]; most current MSA methods have poor accuracy on large datasets, especially for high rates of evolution [3, 4].The difficulty in accurately estimating large MSAs is a major limiting factor in phylogenetic analyses of datasets containing several hundred sequences or more

  • We report the total column score (TC), which is the percentage of aligned columns in the true or reference alignment that appear in the estimated MSA

  • UPP algorithm design We explored modifications of the UPP design in which we varied the backbone size, used a single Hidden Markov model (HMM) instead of an ensemble, built ensembles based on clades within the backbone tree, built ensembles based on disjoint subsets of ten sequences each, used different MSA methods to compute the backbone alignment, used MAFFT instead of hmmalign to add sequences to the backbone alignment, and ran hmmbuild using different options to compute HMMs on each subset alignment

Read more

Summary

Introduction

Multiple sequence alignments (MSAs) of large datasets, containing several thousand to many tens of thousands of sequences, are used for estimating the gene family tree for multi-copy genes (e.g., the p450 or 16S genes), estimating viral evolution, detecting remote homology, predicting the contact map between proteins [1], and inferring deep evolution [2]; most current MSA methods have poor accuracy on large datasets, especially for high rates of evolution [3, 4].The difficulty in accurately estimating large MSAs is a major limiting factor in phylogenetic analyses of datasets containing several hundred sequences or more. ML phylogeny estimation on datasets containing thousands [8] to tens of thousands [9] of sequences is feasible, but the accuracy of ML trees depends on having Another challenge confronting MSA methods is the presence of fragmentary sequences in the input dataset (see Fig. 1 for examples of sequence length heterogeneity found in the biological datasets used in this study). This can result from a variety of causes, including the use of next-generation sequencing technologies, which can produce short reads that cannot be successfully assembled into full-length sequences

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call