DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment.

Erik S Wright

doi:10.1186/s12859-015-0749-z

Erik S Wright

Open Access

https://doi.org/10.1186/s12859-015-0749-z

Copy DOI

Abstract

BackgroundAlignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments.ResultsTwo predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets.ConclusionsPredicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0749-z) contains supplementary material, which is available to authorized users.

Highlights

Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality
The accurate alignment of large numbers of sequences remains an unsolved challenge that is frequently encountered in modern datasets
For sequences with less than 10 % identity, PREFAB has 13.4 % greater structural identity (p < 1e-15) than SABmark. These findings are in agreement with a previous study [53] that found PREFAB to be the best benchmark designed for comparing Multiple sequence alignment (MSA) programs, PREFAB is known to contain errors [56]

Summary

Introduction

Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and diminish steadily as more sequences are added. A multiple sequence alignment may reveal many aspects about a gene: which regions are constrained, which sites undergo positive selection [5], and potentially the structure of its gene product [6]. Many of these applications depend on the correct alignment of thousands of diverse sequences. The accurate alignment of large numbers of sequences remains an unsolved challenge that is frequently encountered in modern datasets

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 6, 2015
Citations: 354	License type: cc-by

R Discovery Prime

R Discovery Prime

DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

PROMALS web server for accurate multiple protein sequence alignments
J Pei ... N V Grishin
Nucleic Acids Research | VOL. 35
J Pei, et. al.J Pei ... N V Grishin
08 May 2007
Nucleic Acids Research | VOL. 35

A novel approach to Multiple Sequence Alignment using hadoop data grids
G Sudha Sadasivam ... G Baktavatchalam
International Journal of Bioinformatics Research and Applications | VOL. 6
G Sudha Sadasivam, et. al.G Sudha Sadasivam ... G Baktavatchalam
01 Jan 2009
International Journal of Bioinformatics Research and Applications | VOL. 6

A novel approach to multiple sequence alignment using hadoop data grids
G Sudha Sadasivam ... G Baktavatchalam
-
G Sudha Sadasivam, et. al.G Sudha Sadasivam ... G Baktavatchalam
26 Apr 2010
26 Apr 2010

MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts
Xin Deng ... Jianlin Cheng
BMC Bioinformatics | VOL. 12
Xin Deng, et. al.Xin Deng ... Jianlin Cheng
01 Dec 2011
BMC Bioinformatics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics