Ultra-large alignments using phylogeny-aware profiles.

Nam-Phuong D Nguyen,Siavash Mirarab,Tandy Warnow,Keerthana Kumar

doi:10.1186/s13059-015-0688-z

Nam-Phuong D Nguyen, Siavash Mirarab + Show 2 more

Open Access

https://doi.org/10.1186/s13059-015-0688-z

Copy DOI

Abstract

Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp.Electronic supplementary materialThe online version of this article (doi:10.1186/s13059-015-0688-z) contains supplementary material, which is available to authorized users.

Highlights

Multiple sequence alignments (MSAs) of large datasets, containing several thousand to many tens of thousands of sequences, are used for estimating the gene family tree for multi-copy genes, estimating viral evolution, detecting remote homology, predicting the contact map between proteins [1], and inferring deep evolution [2]; most current MSA methods have poor accuracy on large datasets, especially for high rates of evolution [3, 4].The difficulty in accurately estimating large MSAs is a major limiting factor in phylogenetic analyses of datasets containing several hundred sequences or more
We report the total column score (TC), which is the percentage of aligned columns in the true or reference alignment that appear in the estimated MSA
UPP algorithm design We explored modifications of the UPP design in which we varied the backbone size, used a single Hidden Markov model (HMM) instead of an ensemble, built ensembles based on clades within the backbone tree, built ensembles based on disjoint subsets of ten sequences each, used different MSA methods to compute the backbone alignment, used MAFFT instead of hmmalign to add sequences to the backbone alignment, and ran hmmbuild using different options to compute HMMs on each subset alignment

Summary

Introduction

Multiple sequence alignments (MSAs) of large datasets, containing several thousand to many tens of thousands of sequences, are used for estimating the gene family tree for multi-copy genes (e.g., the p450 or 16S genes), estimating viral evolution, detecting remote homology, predicting the contact map between proteins [1], and inferring deep evolution [2]; most current MSA methods have poor accuracy on large datasets, especially for high rates of evolution [3, 4].The difficulty in accurately estimating large MSAs is a major limiting factor in phylogenetic analyses of datasets containing several hundred sequences or more. ML phylogeny estimation on datasets containing thousands [8] to tens of thousands [9] of sequences is feasible, but the accuracy of ML trees depends on having Another challenge confronting MSA methods is the presence of fragmentary sequences in the input dataset (see Fig. 1 for examples of sequence length heterogeneity found in the biological datasets used in this study). This can result from a variety of causes, including the use of next-generation sequencing technologies, which can produce short reads that cannot be successfully assembled into full-length sequences

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Genome Biology	Publication Date: Jun 16, 2015
Citations: 154	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Ultra-large alignments using phylogeny-aware profiles.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genome Biology

Lead the way for us

Similar Papers

MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts
Xin Deng ... Jianlin Cheng
BMC Bioinformatics | VOL. 12
Xin Deng, et. al.Xin Deng ... Jianlin Cheng
01 Dec 2011
BMC Bioinformatics | VOL. 12

A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives.
Julie D Thompson ... Benjamin Linard
PLoS ONE | VOL. 6
Julie D Thompson, et. al.Julie D Thompson ... Benjamin Linard
31 Mar 2011
PLoS ONE | VOL. 6

Fast multiple sequence alignment via multi-armed bandits.
Kayvon Mazooji ... Ilan Shomorony
Bioinformatics (Oxford, England) | VOL. 40
Kayvon Mazooji, et. al.Kayvon Mazooji ... Ilan Shomorony
28 Jun 2024
Bioinformatics (Oxford, England) | VOL. 40

MSACompro: Improving Multiple Protein Sequence Alignment by Predicted Structural Features
Xin Deng ... Jianlin Cheng
-
Xin Deng, et. al.Xin Deng ... Jianlin Cheng
23 Aug 2013
23 Aug 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Ultra-large alignments using phylogeny-aware profiles.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genome Biology