Abstract
Graph representations are traditionally used to represent protein structures in sequence design protocols in which the protein backbone conformation is known. This infrequently extends to machine learning projects: existing graph convolution algorithms have shortcomings when representing protein environments. One reason for this is the lack of emphasis on edge attributes during massage-passing operations. Another reason is the traditionally shallow nature of graph neural network architectures. Here we introduce an improved message-passing operation that is better equipped to model local kinematics problems such as protein design. Our approach, XENet, pays special attention to both incoming and outgoing edge attributes. We compare XENet against existing graph convolutions in an attempt to decrease rotamer sample counts in Rosetta's rotamer substitution protocol, used for protein side-chain optimization and sequence design. This use case is motivating because it both reduces the size of the search space for classical side-chain optimization algorithms, and allows larger protein design problems to be solved with quantum algorithms on near-term quantum computers with limited qubit counts. XENet outperformed competing models while also displaying a greater tolerance for deeper architectures. We found that XENet was able to decrease rotamer counts by 40% without loss in quality. This decreased the memory consumption for classical pre-computation of rotamer energies in our use case by more than a factor of 3, the qubit consumption for an existing sequence design quantum algorithm by 40%, and the size of the solution space by a factor of 165. Additionally, XENet displayed an ability to handle deeper architectures than competing convolutions.
Highlights
Protein design involves astronomically large search problems beyond the capabilities of even the largest supercomputers. [1] This task traditionally involves assuming a static protein backbone and representing all candidate side-chain conformations and identities as discrete possibilities called “rotamers”. [2,3,4] A single sequence position on the protein can have hundreds of candidate rotamers when spanning all twenty native amino acids
For problems with tens of variable positions and thousands to tens of thousands of total rotamers, it is necessary to use heuristic methods that do not offer guarantees of finding the global optimum, such as the simulated annealing approaches implemented in the Rosetta software suite. [7, 8] Because a protein designer is often interested in diverse near-optimal solutions rather than in the single unique solution that optimizes the scoring function, protein designers often sacrifice the guarantee of finding the global optimum in favor of having a convenient means of rapidly sampling from the pool of near-optimal solutions
As baseline model for this experiment we considered Edge-Conditioned Convolutions (ECCs), since it is one of the first and most widely used graph neural networks (GNNs) designed to process edge attributes, and we compare it against different configurations of CrystalConv and XENet to ensure a fair comparison
Summary
Protein design involves astronomically large search problems beyond the capabilities of even the largest supercomputers. [1] This task traditionally involves assuming a static protein backbone and representing all candidate side-chain conformations and identities as discrete possibilities called “rotamers”. [2,3,4] A single sequence position on the protein can have hundreds of candidate rotamers when spanning all twenty native amino acids. For trivially small rotamer optimization problems, exhaustive enumeration is feasible, but grows infeasible for most real-world problems since the number of possible solutions given N variable amino acid positions and D rotamers per position is DN, scaling exponentially. For problems with tens of variable positions and thousands to tens of thousands of total rotamers, it is necessary to use heuristic methods that do not offer guarantees of finding the global optimum, such as the simulated annealing approaches implemented in the Rosetta software suite. Given the exponentially-scaling solution space, even heuristic methods cease to be effective at sampling the lowenergy solutions for rotamer optimization problems with hundreds of variable positions or hundreds of thousands of total rotamers
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have