Abstract

The protein structure field is experiencing a revolution. From the increased throughput of techniques to determine experimental structures, to developments such as cryo-EM that allow us to find the structures of large protein complexes or, more recently, the development of artificial intelligence tools, such as AlphaFold, that can predict with high accuracy the folding of proteins for which the availability of homology templates is limited. Here we quantify the effect of the recently released AlphaFold database of protein structural models in our knowledge on human proteins. Our results indicate that our current baseline for structural coverage of 48%, considering experimentally-derived or template-based homology models, elevates up to 76% when including AlphaFold predictions. At the same time the fraction of dark proteome is reduced from 26% to just 10% when AlphaFold models are considered. Furthermore, although the coverage of disease-associated genes and mutations was near complete before AlphaFold release (69% of Clinvar pathogenic mutations and 88% of oncogenic mutations), AlphaFold models still provide an additional coverage of 3% to 13% of these critically important sets of biomedical genes and mutations. Finally, we show how the contribution of AlphaFold models to the structural coverage of non-human organisms, including important pathogenic bacteria, is significantly larger than that of the human proteome. Overall, our results show that the sequence-structure gap of human proteins has almost disappeared, an outstanding success of direct consequences for the knowledge on the human genome and the derived medical applications.

Highlights

  • Ever since the first protein structure was published in 1958 [1] it was clear that structure information is essential to understand the biological functions of proteins

  • Over the last 25 years, the protein structure field has made significant advances. This is evidenced by the fact that the Protein Data Bank (PDB) [2], the main database for protein structure coordinates, in 1995 had 4,455 protein coordinate files, whereas by 2020 the number had increased to 177,806 (Fig 1A)

  • We considered only hits with an e-value below 1e-8 and sequence identity 20%, the limit thresholds for template-based homology modelling [3]

Read more

Summary

Introduction

Ever since the first protein structure was published in 1958 [1] it was clear that structure information is essential to understand the biological functions of proteins. Several groups in the late 1980s and early 1990s observed that the protein structure was much more conserved than its sequence [3], which led to the creation of the first computational tools to predict protein folding [4–8]. In order to systematically assess the performance of all these tools and monitor the advances of the protein folding prediction field, the Critical Assessment of protein Structure Prediction (CASP) experiments were established in 1994 [9]. These experiments have held a high-standard in the field and in recent years have witnessed the massive progress that protein structure prediction has made thanks to, among others, the use of artificial intelligence approaches [10]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call