Quality Of Multiple Sequence Alignment Research Articles

Protein secondary structure prediction (PSSP) is one of the fundamental and challenging problems in the field of computational biology. Accurate PSSP relies on sufficient homologous protein sequences to build the multiple sequence alignment (MSA). Unfortunately, many proteins lack homologous sequences, which results in the low quality of MSA and poor performance. In this article, we propose the novel dynamic scoring matrix (DSM)-Distil to tackle this issue, which takes advantage of the pretrained BERT and exploits the knowledge distillation on the newly designed DSM features. Specifically, we propose the DSM to replace the widely used profile and PSSM (position-specific scoring matrix) features. DSM could automatically dig for the suitable feature for each residue, based on the original profile. Namely, DSM-Distil not only could adapt to the low homologous proteins but also is compatible with high homologous ones. Thanks to the dynamic property, DSM could adapt to the input data much better and achieve higher performance. Moreover, to compensate for low-quality MSA, we propose to generate the pseudo-DSM from a pretrained BERT model and aggregate it with the original DSM by adaptive residue-wise fusion, which helps to build richer and more complete input features. In addition, we propose to supervise the learning of low-quality DSM features using high-quality ones. To achieve this, a novel teacher-student model is designed to distill the knowledge from proteins with high homologous sequences to that of low ones. Combining all the proposed methods, our model achieves the new state-of-the-art performance for low homologous proteins. Compared with the previous state-of-the-art method 'Bagging', DSM-Distil achieves an improvement about 5% and 7.3% improvement for proteins with MSA count ≤30 and extremely low homologous cases, respectively. We also compare DSM-Distil with Alphafold2 which is a state-of-the-art framework for protein structure prediction. DSM-Distil outperforms Alphafold2 by 4.1% on extremely low-quality MSA on 8-state secondary structure prediction. Moreover, we release a large-scale up-to-date test dataset BC40 for low-quality MSA structure prediction evaluation. BC40 dataset: https://drive.google.com/drive/folders/15vwRoOjAkhhwfjDk6-YoKGf4JzZXIMC. HardCase dataset: https://drive.google.com/drive/folders/1BvduOr2b7cObUHy6GuEWk-aUkKJgzTUv. Code: https://github.com/qinwang-ai/DSM-Distil.

Read full abstract

MotivationAccurate prediction of residue–residue distances is important for protein structure prediction. We developed several protein distance predictors based on a deep learning distance prediction method and blindly tested them in the 14th Critical Assessment of Protein Structure Prediction (CASP14). The prediction method uses deep residual neural networks with the channel-wise attention mechanism to classify the distance between every two residues into multiple distance intervals. The input features for the deep learning method include co-evolutionary features as well as other sequence-based features derived from multiple sequence alignments (MSAs). Three alignment methods are used with multiple protein sequence/profile databases to generate MSAs for input feature generation. Based on different configurations and training strategies of the deep learning method, five MULTICOM distance predictors were created to participate in the CASP14 experiment.ResultsBenchmarked on 37 hard CASP14 domains, the best performing MULTICOM predictor is ranked 5th out of 30 automated CASP14 distance prediction servers in terms of precision of top L/5 long-range contact predictions [i.e. classifying distances between two residues into two categories: in contact (<8 Angstrom) and not in contact otherwise] and performs better than the best CASP13 distance prediction method. The best performing MULTICOM predictor is also ranked 6th among automated server predictors in classifying inter-residue distances into 10 distance intervals defined by CASP14 according to the precision of distance classification. The results show that the quality and depth of MSAs depend on alignment methods and sequence databases and have a significant impact on the accuracy of distance prediction. Using larger training datasets and multiple complementary features improves prediction accuracy. However, the number of effective sequences in MSAs is only a weak indicator of the quality of MSAs and the accuracy of predicted distance maps. In contrast, there is a strong correlation between the accuracy of contact/distance predictions and the average probability of the predicted contacts, which can therefore be more effectively used to estimate the confidence of distance predictions and select predicted distance maps. Availability and implementationThe software package, source code and data of DeepDist2 are freely available at https://github.com/multicom-toolbox/deepdist and https://zenodo.org/record/4712084#.YIIM13VKhQM.Supplementary information Supplementary data are available at Bioinformatics online.

Read full abstract

Quality Of Multiple Sequence Alignment Research Articles

Related Topics

Articles published on Quality Of Multiple Sequence Alignment

PC_ali: A tool for improved multiple alignments and evolutionary inference based on a hybrid protein sequence and structure similarity score.

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model.

Prior knowledge facilitates low homologous protein secondary structure prediction with DSM distillation.

Improving protein tertiary structure prediction by deep learning and distance prediction in CASP14.

Improving deep learning-based protein distance prediction in CASP14.

Comparing different machine learning and mathematical regression models to evaluate multiple sequence alignments

Evolutionary Distances in the Twilight Zone—A Rational Kernel Approach

DNA^+Pro^: an Improved Progressive Multiple Sequence Alignment Algorithm for Evolutionary Analysis Using Combined DNA-Protein Sequences

BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments

Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmark

No so HoT - heads or tails is not able to reliably compare multiple sequence alignments.

A statistical score for assessing the quality of multiple sequence alignments

Structural clues in the sequences of the aquaporins

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Quality Of Multiple Sequence Alignment Research Articles

Related Topics

Articles published on Quality Of Multiple Sequence Alignment

PC_ali: A tool for improved multiple alignments and evolutionary inference based on a hybrid protein sequence and structure similarity score.

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model.

Prior knowledge facilitates low homologous protein secondary structure prediction with DSM distillation.

Improving protein tertiary structure prediction by deep learning and distance prediction in CASP14.

Improving deep learning-based protein distance prediction in CASP14.

Comparing different machine learning and mathematical regression models to evaluate multiple sequence alignments

Evolutionary Distances in the Twilight Zone—A Rational Kernel Approach

DNA^+Pro^: an Improved Progressive Multiple Sequence Alignment Algorithm for Evolutionary Analysis Using Combined DNA-Protein Sequences

BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments

Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmark

No so HoT - heads or tails is not able to reliably compare multiple sequence alignments.

A statistical score for assessing the quality of multiple sequence alignments

Structural clues in the sequences of the aquaporins