Abstract

We examine the problem of constructing fitness landscape of proteins for generating amino acid sequences that would fold into an a priori determined structural fold. Such a landscape would be useful for engineering proteins with novel or enhanced biochemistry. It should be able to characterize the global fitness landscape of many proteins simultaneously, and can guide the search process to identify the correct protein sequences. We introduce two geometric views and propose a formulation using mixture of nonlinear Gaussian kernel functions. We aim to solve a simplified protein sequence design problem. Our goal is to distinguish each native sequence for a major portion of representative protein structures from a large number of alternative decoy sequences, each a fragment from proteins of different folds. The nonlinear fitness function developed discriminates perfectly a set of 440 native proteins from 14 million sequence decoys, while no linear fitness function can succeed in this task. In a blind test of unrelated proteins, the nonlinear fitness function misclassifies only 13 native proteins out of 194. This compares favorably with about 3–4 times more misclassifications when optimal linear functions are used. To significantly reduce the complexity of the nonlinear fitness function, we further constructed a simplified nonlinear fitness function using a rectangular kernel with a basis set of proteins and decoys chosen a priori. The full landscape for a large number of protein folds can be captured using only 480 native proteins and 3200 nonprotein decoys via a finite Newton method, compared to about 7000 proteins and decoys in the original nonlinear fitness function. A blind test of a simplified version of sequence design was carried out to discriminate simultaneously 428 native sequences with no significant sequence identity to any training proteins from 11 million challenging protein-like decoys. This simplified fitness function correctly classified 408 native sequences, with only 20 misclassifications (95% correct rate), which outperforms several other statistical linear fitness functions and optimized linear functions. Our results further suggested that for the task of global sequence design, the search space of protein shape and sequence can be effectively parameterized with a relatively small number of carefully chosen basis set of proteins and decoys. For example, the task of designing 428 selected nonhomologous proteins can be achieved using a basis set of about 3680 proteins and decoys. In addition, we showed that the overall landscape is not overly sensitive to the specific choice of the proteins and decoys. The construction of fitness landscape has broad implication in understanding molecular evolution, cellular epigenetic state, and protein structures. Our results can be generalized to construct other types of fitness landscape.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.