Abstract

Proteins are macromolecules that perform essential biological functions which depend on their three-dimensional structure. Determining this structure involves complex laboratory and computational work. For the computational work, multiple software pipelines have been developed to build models of the protein structure from crystallographic data. Each of these pipelines performs differently depending on the characteristics of the electron-density map received as input. Identifying the best pipeline to use for a protein structure is difficult, as the pipeline performance differs significantly from one protein structure to another. As such, researchers often select pipelines that do not produce the best possible protein models from the available data. Here, a software tool is introduced which predicts key quality measures of the protein structures that a range of pipelines would generate if supplied with a given crystallographic data set. These measures are crystallographic quality-of-fit indicators based on included and withheld observations, and structure completeness. Extensive experiments carried out using over 2500 data sets show that the tool yields accurate predictions for both experimental phasing data sets (at resolutions between 1.2 and 4.0 Å) and molecular-replacement data sets (at resolutions between 1.0 and 3.5 Å). The tool can therefore provide a recommendation to the user concerning the pipelines that should be run in order to proceed most efficiently to a depositable model.

Highlights

  • The first protein structures were determined in the 1950s using X-ray crystallography (Kendrew et al, 1958)

  • mean absolute error (MAE) and root-mean-square error (RMSE) were calculated for the ML predictive model (P) and median predictor (M) used as a baseline (Zero-R) model

  • 0.26) for predicting the protein structure completeness are higher than the MAE and RMSE for the other measures

Read more

Summary

Introduction

The first protein structures were determined in the 1950s using X-ray crystallography (Kendrew et al, 1958). By 2020, the number of solved protein structures deposited in the Protein Data Bank (PDB) exceeded 154 000 (Berman et al, 2000; https://www.rcsb.org/stats/summary) To enable this progress, researchers have automated the computational work of determining the protein structure from X-ray crystallographic data sets. The resolution of the experimental observations, the quality of experimental phasing or the similarity of the molecular-replacement model, and many other features such as ice rings may affect the quality of the data Each of these factors impact the performance of different model-building algorithms in different ways (Vollmar et al, 2020; Alharbi et al, 2019; Morris et al, 2004)

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.