Abstract

X-ray diffraction technique is one of the most common methods of ascertaining protein structures, yet only 2–10% of proteins can produce diffraction-quality crystals. Several computational methods have been proposed so far to predict protein crystallization. Nevertheless, the current state-of-the-art computational methods are limited by the scarcity of experimental data. Thus, the prediction accuracy of existing models hasn’t reached the ideal level. To address the problems above, we propose a novel transfer-learning-based framework for protein crystallization prediction, named TLCrys. The framework proceeds in two steps: pre-training and fine-tuning. The pre-training step adopts attention mechanism to extract both global and local information of the protein sequences. The representation learned from the pre-training step is regarded as knowledge to be transferred and fine-tuned to enhance the performance of crystalization prediction. During pre-training, TLCrys adopts a multi-task learning method, which not only improves the learning ability of protein encoding, but also enhances the robustness and generalization of protein representation. The multi-head self-attention layer guarantees that different levels of the protein representation can be extracted by the fine-tuned step. During transfer learning, the fine-tuning strategy used by TLCrys improves the task-specialized learning ability of the network. Our method outperforms all previous predictors significantly in five crystallization stages of prediction. Furthermore, the proposed methodology can be well generalized to other protein sequence classification tasks.

Highlights

  • Received: 13 December 2021The functions of a protein are largely determined by its three-dimensional structure.analyzing the three-dimensional structure of proteins is of great significance for understanding the molecular mechanism of biological processes and studying the pathogenesis mechanism of diseases

  • At present, existing methods used to identify the three-dimensional structure of protein sequences are electron microscopy [1], Nuclear Magnetic Resonance (NMR) spectroscopy [2], and X-ray diffraction crystallography (X-ray diffraction measurement, XRD) [3]

  • In order to overcome the problems of insufficient labeling training data and inaccurate model prediction results, and to explore the internal correlation between protein sequence modeling and crystallization propensity, we propose a novel transfer learning based method for protein crystallization prediction, called TLCrys

Read more

Summary

Introduction

Analyzing the three-dimensional structure of proteins is of great significance for understanding the molecular mechanism of biological processes and studying the pathogenesis mechanism of diseases. It can provide key information for the development and design of drugs for human diseases. At present, existing methods used to identify the three-dimensional structure of protein sequences are electron microscopy [1], Nuclear Magnetic Resonance (NMR) spectroscopy [2], and X-ray diffraction crystallography (X-ray diffraction measurement, XRD) [3]. Experimenting with X-ray diffraction crystallography for proteins that cannot crystallize at the current experimental level, are costly and time-consuming

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call