Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments.

Desmond G Higgins,Fabian Sievers,Gearóid Fox

doi:10.1093/bioinformatics/btv592

Desmond G Higgins, Fabian Sievers + Show 1 more

Open Access

https://doi.org/10.1093/bioinformatics/btv592

Copy DOI

Journal: Bioinformatics (Oxford, England)	Publication Date: Nov 14, 2015
Citations: 20	License type: CC BY 4.0

Affiliation: University College Dublin

Abstract

Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins. Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz. Contact: des.higgins@ucd.ie Supplementary information: Supplementary data are available at Bioinformatics online.

Highlights

Making a multiple sequence alignment (MSA) of nucleotide or amino acid sequences is a crucial step needed in a wide variety of bioinformatics studies
We developed ContTest, a benchmark for large protein MSAs based on the accuracy of de novo contact map prediction
The Pfam database contains MSAs of all sequences in each protein family, and we used the benchmark to score the full alignments from Pfam 27

Summary

Introduction

Making a multiple sequence alignment (MSA) of nucleotide or amino acid sequences is a crucial step needed in a wide variety of bioinformatics studies. Structure- and phylogeny-based benchmarks, in which scores are based on structural superpositions and accurate inference of phylogenetic trees, respectively, are strongly grounded in empirical biological data, but they focus on alignments of small numbers of sequences and are difficult to scale to larger datasets. Simulation- and consistency-based benchmarks are based on simulations of protein evolution and simple agreement between different MSA methods, respectively, and can involve alignments of arbitrary size. It is unclear, how well simulated sequences model actual biological sequences, while consistency measures only how similar the results of one heuristic method are to the results of other heuristic methods

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)

Lead the way for us

Similar Papers

On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures
Diana H.P Low ... David A Bader
Journal of Parallel and Distributed Computing | VOL. 67
Diana H.P Low, et. al.Diana H.P Low ... David A Bader
17 May 2007
Journal of Parallel and Distributed Computing | VOL. 67

Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees.
Kazunori D Yamada ... Kazutaka Katoh
Bioinformatics | VOL. 32
Kazunori D Yamada, et. al.Kazunori D Yamada ... Kazutaka Katoh
04 Jul 2016
Bioinformatics | VOL. 32

PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments
David T Jones ... Massimiliano Pontil
Bioinformatics | VOL. 28
David T Jones, et. al.David T Jones ... Massimiliano Pontil
17 Nov 2011
Bioinformatics | VOL. 28

Heuristic Methods for Finding Pathogenic Variants in Gene Coding Sequences
Monique Ohanian ... Diane Fatkin
Journal of the American Heart Association | VOL. 1
Monique Ohanian, et. al.Monique Ohanian ... Diane Fatkin
26 Sep 2012
Journal of the American Heart Association | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)