MAGUS+eHMMs: improved multiple sequence alignment accuracy for fragmentary sequences.

Paul Zaharias,Valentina Boeva,Tandy Warnow,Chengze Shen

doi:10.1093/bioinformatics/btab788

Paul Zaharias, Valentina Boeva

Open Access

https://doi.org/10.1093/bioinformatics/btab788

Copy DOI

Abstract

SummaryMultiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation.Availability and implementationUPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

Multiple sequence alignment (MSA) is a critical precursor for many downstream analyses, such as gene and species tree estimation (Heled and Drummond, 2010; Stamatakis, 2014), protein family classification (Nguyen et al, 2016) and phylogenetic placement (Matsen et al, 2010)
Summary: Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc
We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy

Summary

Introduction

Multiple sequence alignment (MSA) is a critical precursor for many downstream analyses, such as gene and species tree estimation (Heled and Drummond, 2010; Stamatakis, 2014), protein family classification (Nguyen et al, 2016) and phylogenetic placement (Matsen et al, 2010). To produce MSAs on datasets that contain both full-length and fragmentary sequences, techniques for adding fragmentary sequences into alignments of (generally) full-length sequences have been developed, including MAFFT –addfragments (Katoh and Frith, 2012) and techniques based on ensembles of profile HMMs (Mirarab et al, 2012; Nguyen et al, 2015). These methods, when provided with good alignments on the full-length sequences, have been shown to provide better accuracy than alignment methods that do not explicitly take fragmentary sequences into account. UPP (Nguyen et al, 2015), which constructs a ‘backbone alignment’ on a sample of the full-length sequences using PASTA (Mirarab et al, 2015) and adds the remaining sequences into the backbone alignment using an ensemble of profile HMMs (i.e. eHMMs) technique, was shown to provide very good accuracy and scalability to large and ultra-large datasets (up to 1 000 000 sequences), even in the presence of high levels of fragmentary sequences

Methods

Discussion

Conclusion