Abstract

Manually identifying and correcting errors in protein models can be a slow process, but improvements in validation tools and automated model-building software can contribute to reducing this burden. This article presents a new correctness score that is produced by combining multiple sources of information using a neural network. The residues in 639 automatically built models were marked as correct or incorrect by comparing them with the coordinates deposited in the PDB. A number of features were also calculated for each residue using Coot, including map-to-model correlation, density values, B factors, clashes, Ramachandran scores, rotamer scores and resolution. Two neural networks were created using these features as inputs: one to predict the correctness of main-chain atoms and the other for side chains. The 639 structures were split into 511 that were used to train the neural networks and 128 that were used to test performance. The predicted correctness scores could correctly categorize 92.3% of the main-chain atoms and 87.6% of the side chains. A Coot ML Correctness script was written to display the scores in a graphical user interface as well as for the automatic pruning of chains, residues and side chains with low scores. The automatic pruning function was added to the CCP4i2 Buccaneer automated model-building pipeline, leading to significant improvements, especially for high-resolution structures.

Highlights

  • Manual completion of a model is a very time-consuming step in macromolecular structure solution

  • The coefficient of determination (COD) for the trained neural network models is shown in Table 3 for both the training set and the test set

  • Overfitting is unlikely owing to the large number of residues and the small size of the neural network, but there could be some differences between the training and test sets depending on the random split of the 639 structures

Read more

Summary

Introduction

Manual completion of a model is a very time-consuming step in macromolecular structure solution. Initial models from homologues or from automated model-building programs will contain errors that must be identified and corrected. The primary method for identifying errors is visual examination of the model, the 2mFo À DFc map and the mFo À DFc map by the crystallographer, using a model-building program such as Coot (Emsley & Cowtan, 2004; Emsley et al, 2010). The work presented here aims to emulate this decision-making process by using machine learning to predict the correctness of protein residues.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.