Large Multiple Sequence Alignments Research Articles

Motivation: Identification of functionally important residues in proteins plays a significant role in biological discovery. Here, we present INTREPID—an information–theoretic approach for functional site identification that exploits the information in large diverse multiple sequence alignments (MSAs). INTREPID uses a traversal of the phylogeny in combination with a positional conservation score, based on Jensen–Shannon divergence, to rank positions in an MSA. While knowledge of protein 3D structure can significantly improve the accuracy of functional site identification, since structural information is not available for a majority of proteins, INTREPID relies solely on sequence information. We evaluated INTREPID on two tasks: predicting catalytic residues and predicting specificity determinants.Results: In catalytic residue prediction, INTREPID provides significant improvements over Evolutionary Trace, ConSurf as well as over a baseline global conservation method on a set of 100 manually curated enzymes from the Catalytic Site Atlas. In particular, INTREPID is able to better predict catalytic positions that are not globally conserved and hence, attains improved sensitivity at high values of specificity. We also investigated the performance of INTREPID as a function of the evolutionary divergence of the protein family. We found that INTREPID is better able to exploit the diversity in such families and that accuracy improves when homologs with very low sequence identity are included in an alignment. In specificity determinant prediction, when subtype information is known, INTREPID-SPEC, a variant of INTREPID, attains accuracies that are competitive with other approaches for this task.Availability: INTREPID is available for 16919 families in the PhyloFacts resource (http://phylogenomics.berkeley.edu/phylofacts).Contact: sriram_s@cs.berkeley.eduSupplementary information: Relevant online supplementary material is available at http://phylogenomics.berkeley.edu/INTREPID.

Read full abstract

BackgroundPhylogenetic analysis of large, multiple-gene datasets, assembled from public sequence databases, is rapidly becoming a popular way to approach difficult phylogenetic problems. Supermatrices (concatenated multiple sequence alignments of multiple genes) can yield more phylogenetic signal than individual genes. However, manually assembling such datasets for a large taxonomic group is time-consuming and error-prone. Additionally, sequence curation, alignment and assessment of the results of phylogenetic analysis are made particularly difficult by the potential for a given gene in a given species to be unrepresented, or to be represented by multiple or partial sequences. We have developed a software package, TaxMan, that largely automates the processes of sequence acquisition, consensus building, alignment and taxon selection to facilitate this type of phylogenetic study.ResultsTaxMan uses freely available tools to allow rapid assembly, storage and analysis of large, aligned DNA and protein sequence datasets for user-defined sets of species and genes. The user provides GenBank format files and a list of gene names and synonyms for the loci to analyse. Sequences are extracted from the GenBank files on the basis of annotation and sequence similarity. Consensus sequences are built automatically. Alignment is carried out (where possible, at the protein level) and aligned sequences are stored in a database. TaxMan can automatically determine the best subset of taxa to examine phylogeny at a given taxonomic level. By using the stored aligned sequences, large concatenated multiple sequence alignments can be generated rapidly for a subset and output in analysis-ready file formats. Trees resulting from phylogenetic analysis can be stored and compared with a reference taxonomy.ConclusionTaxMan allows rapid automated assembly of a multigene datasets of aligned sequences for large taxonomic groups. By extracting sequences on the basis of both annotation and BLAST similarity, it ensures that all available sequence data can be brought to bear on a phylogenetic problem, but remains fast enough to cope with many thousands of records. By automatically assisting in the selection of the best subset of taxa to address a particular phylogenetic problem, TaxMan greatly speeds up the process of generating multiple sequence alignments for phylogenetic analysis. Our results indicate that an automated phylogenetic workbench can be a useful tool when correctly guided by user knowledge.

Read full abstract

Large Multiple Sequence Alignments Research Articles

Articles published on Large Multiple Sequence Alignments

INTREPID—INformation-theoretic TREe traversal for Protein functional site IDentification

Rewiring Bacteria, Two Components at a Time

Using inferred residue contacts to distinguish between correct and incorrect protein models

On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures

TaxMan: a taxonomic database manager.

Evolutionarily Conserved Allosteric Network in the Cys Loop Family of Ligand-gated Ion Channels Revealed by Statistical Covariance Analyses

Large scale multiple sequence alignment with simultaneous phylogeny inference

The Jalview Java alignment editor.

Assessing functional divergence in EF-1alpha and its paralogs in eukaryotes and archaebacteria.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Multiple Sequence Alignments Research Articles

Articles published on Large Multiple Sequence Alignments

INTREPID—INformation-theoretic TREe traversal for Protein functional site IDentification

Rewiring Bacteria, Two Components at a Time

Using inferred residue contacts to distinguish between correct and incorrect protein models

On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures

TaxMan: a taxonomic database manager.

Evolutionarily Conserved Allosteric Network in the Cys Loop Family of Ligand-gated Ion Channels Revealed by Statistical Covariance Analyses

Large scale multiple sequence alignment with simultaneous phylogeny inference

The Jalview Java alignment editor.

Assessing functional divergence in EF-1alpha and its paralogs in eukaryotes and archaebacteria.