Abstract

Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.

Highlights

  • The wide variety of natural proteins fulfills many different functions, from catalysis to specific recognition, transport, or regulation

  • The most usual approach to Computational Protein Design (CPD) consists in choosing or de novo constructing a target backbone structure that could carry the function of interest and identify a sequence that will fold onto this backbone and present the expected properties

  • This formulation is convenient to develop algorithms, but it should be noted that it makes CPD an ill-posed problem: the sequence is optimized for the target structure, but this structure may not be optimal for the sequence which may fold in a different structure [13,14]

Read more

Summary

Introduction

The wide variety of natural proteins fulfills many different functions, from catalysis to specific recognition, transport, or regulation. After some background on CPD and Deep Learning, we present the different types of representation that have been used to represent protein data, both sequences and structures, when used for design or related tasks We discuss their strengths and weaknesses, and detail the neural architecture used to process them. This allows to formulate the design problem as an optimization problem: given a input backbone, find a sequence that maximally stabilizes the input backbone (and fulfill the desired function) by minimizing a score function that usually combines the free energy of the resulting protein with other function-related criteria This formulation is convenient to develop algorithms, but it should be noted that it makes CPD an ill-posed problem: the sequence is optimized for the target structure, but this structure may not be optimal for the sequence which may fold in a different structure [13,14]. We focus on the pure sequence design task, aiming at producing a sequence that should either fold in a target backbone or, for some, present a desired function

Evaluation of Design Methods
Background on Deep Learning
Training
Recurrent Architectures
Attention Models
Generative Models
Representation of the Protein Sequence
One-Hot Encoding
Learned Embedding
Position-Specific Scoring Matrices
Representing the Protein Structure
Sequential and Hand-Crafted Representations
Voxel Representation
Distance Maps
Graphs
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call