Large Sequence Alignments Research Articles

BackgroundAlignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments.ResultsTwo predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets.ConclusionsPredicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0749-z) contains supplementary material, which is available to authorized users.

Read full abstract

The addition of asparagine (N)-linked polysaccharide chains (i.e., glycans) to the gp120 and gp41 glycoproteins of human immunodeficiency virus type 1 (HIV-1) envelope is not only required for correct protein folding, but also may provide protection against neutralizing antibodies as a “glycan shield.” As a result, strong host-specific selection is frequently associated with codon positions where nonsynonymous substitutions can create or disrupt potential N-linked glycosylation sites (PNGSs). Moreover, empirical data suggest that the individual contribution of PNGSs to the neutralization sensitivity or infectivity of HIV-1 may be critically dependent on the presence or absence of other PNGSs in the envelope sequence. Here we evaluate how glycan–glycan interactions have shaped the evolution of HIV-1 envelope sequences by analyzing the distribution of PNGSs in a large-sequence alignment. Using a “covarion”-type phylogenetic model, we find that the rates at which individual PNGSs are gained or lost vary significantly over time, suggesting that the selective advantage of having a PNGS may depend on the presence or absence of other PNGSs in the sequence. Consequently, we identify specific interactions between PNGSs in the alignment using a new paired-character phylogenetic model of evolution, and a Bayesian graphical model. Despite the fundamental differences between these two methods, several interactions are jointly identified by both. Mapping these interactions onto a structural model of HIV-1 gp120 reveals that negative (exclusive) interactions occur significantly more often between colocalized glycans, while positive (inclusive) interactions are restricted to more distant glycans. Our results imply that the adaptive repertoire of alternative configurations in the HIV-1 glycan shield is limited by functional interactions between the N-linked glycans. This represents a potential vulnerability of rapidly evolving HIV-1 populations that may provide useful glycan-based targets for neutralizing antibodies.

Read full abstract

Large Sequence Alignments Research Articles

Articles published on Large Sequence Alignments

Rapid alignment updating with Extensiphy

Coevolutionary Analysis of Protein Subfamilies by Sequence Reweighting.

DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment.

K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets.

πBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios.

Integrated Hardware Architecture for Efficient Computation of the $n$-Best Bio-Sequence Local Alignments in Embedded Platforms

Statistical Potentials for Hairpin and Internal Loops Improve the Accuracy of the Predicted RNA Structure

Bayesian Estimation of Divergence Times from Large Sequence Alignments

Streptococcus agalactiae DNA polymerase I is an efficient reverse transcriptase

Evolutionary Interactions between N-Linked Glycosylation Sites in the HIV-1 Envelope

SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny.

ALMA, an editor for large sequence alignments

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Sequence Alignments Research Articles

Articles published on Large Sequence Alignments

Rapid alignment updating with Extensiphy

Coevolutionary Analysis of Protein Subfamilies by Sequence Reweighting.

DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment.

K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets.

πBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios.

Integrated Hardware Architecture for Efficient Computation of the $n$-Best Bio-Sequence Local Alignments in Embedded Platforms

Statistical Potentials for Hairpin and Internal Loops Improve the Accuracy of the Predicted RNA Structure

Bayesian Estimation of Divergence Times from Large Sequence Alignments

Streptococcus agalactiae DNA polymerase I is an efficient reverse transcriptase

Evolutionary Interactions between N-Linked Glycosylation Sites in the HIV-1 Envelope

SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny.

ALMA, an editor for large sequence alignments