Unified rational protein engineering with sequence-based deep representation learning.

Ethan C Alley,Surojit Biswas,Mohammed Alquraishi,George M Church,Grigory Khimulya

doi:10.1038/s41592-019-0598-1

Ethan C Alley, Surojit Biswas + Show 3 more

Open Access

https://doi.org/10.1038/s41592-019-0598-1

Copy DOI

Abstract

Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods. UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task. UniRep is a versatile summary of fundamental protein features that can be applied across protein engineering informatics.

Highlights

We use a recurrent neural network (RNN) to learn statistical representations of proteins from ~24 million UniRef[50] sequences (Fig. 1a)
To assess how semantically related proteins are represented by unified representation (UniRep), we examined its ability to partition structurally similar sequences that share little sequence identity and enable unsupervised clustering of homologous sequences
We considered what sized region of sequence space would make the best training data for UniRep

Summary

Results

An mLSTM learns semantically rich representations from a massive sequence dataset. Multiplicative long-/short-term-memory (mLSTM) RNNs learn rich representations for natural language, enabling state-of-the-art performance on critical tasks[23]. On all nine DMS datasets, UniRep Fusion-based models achieved superior test set performance, outperforming a comprehensive suite of baselines including a state-of-the-art Doc2Vec representation (Fig. 3e and Supplementary Tables 5 and 6) This is surprising given that these proteins share little sequence similarity (408 mutations apart on average), are derived from six different organisms, range in size (264–724 aa), vary from near-universal (hsp90) to organism-specific (Gb1) and take part in diverse biological processes (for example, catalysis, DNA binding, molecular sensing and protein chaperoning)[35]. We compared this to the untuned, global UniRep as well as a randomly initialized UniRep architecture trained only on local evolutionary data (Evotuned Random, see Fig. 4a and Methods) Using these trained unsupervised models we generated representations for the green fluorescent protein from avGFP variant sequences from Sarkisyan et al.[37] and trained simple sparse linear regression top models on each to predict avGFP brightness. The best approach examined here would be 100× more expensive (Supplementary Fig. 12)

Discussion

Methods

Code availability

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nature Methods	Publication Date: Oct 21, 2019
Citations: 781	License type: cc-by

R Discovery Prime

R Discovery Prime

Unified rational protein engineering with sequence-based deep representation learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature Methods

Lead the way for us

Similar Papers

Recent advances in user-friendly computational tools to engineer protein function.
Carlos Eduardo Sequeiros-Borja ... Bartłomiej Surpeta
Briefings in Bioinformatics | VOL. 22
Carlos Eduardo Sequeiros-Borja, et. al.Carlos Eduardo Sequeiros-Borja ... Bartłomiej Surpeta
31 Jul 2020
Briefings in Bioinformatics | VOL. 22

Advances in Rational Protein Engineering toward Functional Architectures and Their Applications in Food Science.
Hai Chen ... Yu Fu
Journal of Agricultural and Food Chemistry | VOL. 70
Hai Chen, et. al.Hai Chen ... Yu Fu
30 Mar 2022
Journal of Agricultural and Food Chemistry | VOL. 70

Rational and Combinatorial Methods to Create Designer Protein Interfaces
B.H. Lui ... J.R. Cochran
-
B.H. Lui, et. al.B.H. Lui ... J.R. Cochran
01 Jan 2010
01 Jan 2010

Recent applications of bio-engineering principles to modulate the functionality of proteins in food systems
Ankita Kataria ... Caleb Maina Yakubu
Trends in Food Science & Technology | VOL. 113
Ankita Kataria, et. al.Ankita Kataria ... Caleb Maina Yakubu
06 May 2021
Trends in Food Science & Technology | VOL. 113

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unified rational protein engineering with sequence-based deep representation learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature Methods