Abstract
Computational protein design or inverse protein folding aims to generate amino acid sequences that fold into an a priori determined structural fold for engineering novel or enhanced biochemistry. For this task, a function describing the fitness landscape of sequences is critical for identifing correct ones that fold into the desired structure. An nonlinear kernel fitness function can be formulated by combining weighted Gaussian kernels centered around a set of native proteins and a set of non-protein decoys. This type of nonlinear fitness function has been shown to offer significant improvement over linear functions in computational blind test of global sequence design. However, this formulation is demanding both in storage and in computational time. We show that nonlinear fitness function for protein design can be significantly improved by using rectangle kernel and a finite Newton method. A blind test of a simplified version of sequence design is carried out to discriminate simultaneously 428 native sequences not homologous to any training proteins from 11 million challenging protein-like decoys. This simplified fitness function correctly classifies 408 native sequences (20 misclassifications, 95% correct rate), which outperforms other statistical linear scoring function and optimized linear function. The performance is also comparable with results obtained from a far more complex nonlinear fitness function with > 5000 terms. Our results further suggest that for the task of global sequence design of 428 selected proteins, the search space of protein shape and sequence can be effectively parametrized with just about 3680 carefully chosen basis set of proteins and decoys, and we show in addition that the overall landscape is not overly sensitive to the specific choice of this set.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have