Abstract
Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. We show that a deep graph neural network, ProteinSolver, can precisely design sequences that fold into a predetermined shape by phrasing this challenge as a constraint satisfaction problem (CSP), akin to Sudoku puzzles. We trained ProteinSolver on over 70,000,000 real protein sequences corresponding to over 80,000 structures. We show that our method rapidly designs new protein sequences and benchmark them in silico using energy-based scores, molecular dynamics, and structure prediction methods. As a proof-of-principle validation, we use ProteinSolver to generate sequences that match the structure of serum albumin, then synthesize the top-scoring design and validate it invitro using circular dichroism. ProteinSolver is freely available at http://design.proteinsolver.org and https://gitlab.com/ostrokach/proteinsolver. A record of this paper's transparent peer review process is included in the Supplemental Information.
Highlights
Protein structure and function emerges from the specific geometric arrangement of their linear sequence of amino acids, commonly referred to as a fold
Network Architecture As there had been little previous work in using neural networks to solve constraint satisfaction problem (CSP) (Palm et al, 2017; Prates et al, 2018), we first had to devise a network architecture that would be well suited for this problem
In order to facilitate this search, we focused on designing a neural network capable of solving Sudoku puzzles, which is a well-defined CSP (25) for which predictions made by the network can be verified
Summary
Protein structure and function emerges from the specific geometric arrangement of their linear sequence of amino acids, commonly referred to as a fold. A sampling technique such as Markov-chain Monte Carlo is used to generate sequences optimized with respect to a force field or statistical potential (Chevalier et al, 2017; Shultis et al, 2019; Sun and Kim, 2017). Limitations of those methods include the relatively low accuracy of existing force fields (Khan and Vihinen, 2010; Kroncke et al, 2016) and the inability to sample more than a miniscule portion of the vast search space (sequence space size is 20N, N being the number of residues). While there have been successful approaches that screen many thousands of individual designs using in vitro selection techniques (Rocklin et al, 2017; Sun et al, 2016), those approaches remain reliant on labor-intensive experiments
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.