Guaranteed Diversity and Optimality in Cost Function Network Based Computational Protein Design Methods

Manon Ruffini,Simon De De Givry,Thomas Schiex,Sophie Barbe,George Katsirelos,Jelena Vucinic

doi:10.3390/a14060168

Abstract

Proteins are the main active molecules of life. Although natural proteins play many roles, as enzymes or antibodies for example, there is a need to go beyond the repertoire of natural proteins to produce engineered proteins that precisely meet application requirements, in terms of function, stability, activity or other protein capacities. Computational Protein Design aims at designing new proteins from first principles, using full-atom molecular models. However, the size and complexity of proteins require approximations to make them amenable to energetic optimization queries. These approximations make the design process less reliable, and a provable optimal solution may fail. In practice, expensive libraries of solutions are therefore generated and tested. In this paper, we explore the idea of generating libraries of provably diverse low-energy solutions by extending cost function network algorithms with dedicated automaton-based diversity constraints on a large set of realistic full protein redesign problems. We observe that it is possible to generate provably diverse libraries in reasonable time and that the produced libraries do enhance the Native Sequence Recovery, a traditional measure of design methods reliability.

Highlights

Proteins are complex molecules that govern much of how cells work, in humans, plants, and microbes. They are made of a succession of simple molecules called α-amino acids
The sidechain defines the nature of the amino acid
As function is closely related to threedimensional (3D) structure [1], computational protein design (CPD) methods aim at finding a sequence that folds into a target 3D structure that corresponds to the desired properties and functions

Summary

Introduction

Proteins are complex molecules that govern much of how cells work, in humans, plants, and microbes. A rotamer library for all 20 natural amino acids containing typically a few hundreds of conformations, the discrete search space becomes very quickly challenging to explore and the problem has been shown to be NP-hard [2] (decision NP-complete) It has been naturally approached by stochastic optimization techniques such as Monte Carlo simulated annealing [3], as in the commonly used Rosetta software [4]. Constraint programming-based algorithms for solving the weighted constraint satisfaction problem (WCSP) on cost function networks (CFN) have been proposed to tackle CPD instances [5,6] These provable methods have shown unprecedented efficiency at optimizing decomposable force fields on genuine protein design instances [6], leading to successfully characterized new proteins [7]. Cost Function Networks are one example of a larger family of mathematical models that aim at representing and analyzing decomposable functions, called graphical models [8,9]

Methods

Results

Conclusion