Rosetta:MSF:NN: Boosting performance of multi-state computational protein design with a neural network.

Julian Nazet,Elmar Lang,Rainer Merkl,Yang Zhang

doi:10.1371/journal.pone.0256691

Abstract

Rational protein design aims at the targeted modification of existing proteins. To reach this goal, software suites like Rosetta propose sequences to introduce the desired properties. Challenging design problems necessitate the representation of a protein by means of a structural ensemble. Thus, Rosetta multi-state design (MSD) protocols have been developed wherein each state represents one protein conformation. Computational demands of MSD protocols are high, because for each of the candidate sequences a costly three-dimensional (3D) model has to be created and assessed for all states. Each of these scores contributes one data point to a complex, design-specific energy landscape. As neural networks (NN) proved well-suited to learn such solution spaces, we integrated one into the framework Rosetta:MSF instead of the so far used genetic algorithm with the aim to reduce computational costs. As its predecessor, Rosetta:MSF:NN administers a set of candidate sequences and their scores and scans sequence space iteratively. During each iteration, the union of all candidate sequences and their Rosetta scores are used to re-train NNs that possess a design-specific architecture. The enormous speed of the NNs allows an extensive assessment of alternative sequences, which are ranked on the scores predicted by the NN. Costly 3D models are computed only for a small fraction of best-scoring sequences; these and the corresponding 3D-based scores replace half of the candidate sequences during each iteration. The analysis of two sets of candidate sequences generated for a specific design problem by means of a genetic algorithm confirmed that the NN predicted 3D-based scores quite well; the Pearson correlation coefficient was at least 0.95. Applying Rosetta:MSF:NN:enzdes to a benchmark consisting of 16 ligand-binding problems showed that this protocol converges ten-times faster than the genetic algorithm and finds sequences with comparable scores.

Highlights

Computational protein design has become an important tool in molecular biology [1]
In order to allow for a fair comparison with Rosetta:MSF:genetic algorithm (GA), the novel neural networks (NN)-based approach Rosetta:MSF:NN administers for each iteration r a set OPTr of s = 239 sequences, whose RSj3DM values are used for their ranking; see Fig 1B
For each design protk the reference set OPTGk;Ad was identified. This set represents among the first δ = 1−500 GA iterations the earliest generation δ, whose score value was most similar to RSNNðOPTNk;1N00Þ, which was generated by Rosetta:MSF:NN:enzdes during the last iteration of the NN-based protocol

Summary

Introduction

Computational protein design has become an important tool in molecular biology [1]. Different approaches and protocols have proven their reliability for a broad range of applications. Even for design problems of moderate complexity, a hybrid method that does not require the computation of a costly 3Dopt model to score each of the candidate sequences might drastically reduce computation time of Rosetta’s protein design protocols. This search for optimal sequences can be considered a problem of multi-dimensional regression, were every combination of amino acid residues yields one data point of the designspecific energy landscape. We found that ROSETTA:MSF:NN converges 10-times faster than our previous protocol and samples alternative areas of sequence space

Materials and methods

Design and implementation of the NN

Results and discussion

Does the NN-based approach find sequences with better Rosetta scores?