Abstract

Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods. UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task. UniRep is a versatile summary of fundamental protein features that can be applied across protein engineering informatics.

Highlights

  • We use a recurrent neural network (RNN) to learn statistical representations of proteins from ~24 million UniRef[50] sequences (Fig. 1a)

  • To assess how semantically related proteins are represented by unified representation (UniRep), we examined its ability to partition structurally similar sequences that share little sequence identity and enable unsupervised clustering of homologous sequences

  • We considered what sized region of sequence space would make the best training data for UniRep

Read more

Summary

Results

An mLSTM learns semantically rich representations from a massive sequence dataset. Multiplicative long-/short-term-memory (mLSTM) RNNs learn rich representations for natural language, enabling state-of-the-art performance on critical tasks[23]. On all nine DMS datasets, UniRep Fusion-based models achieved superior test set performance, outperforming a comprehensive suite of baselines including a state-of-the-art Doc2Vec representation (Fig. 3e and Supplementary Tables 5 and 6) This is surprising given that these proteins share little sequence similarity (408 mutations apart on average), are derived from six different organisms, range in size (264–724 aa), vary from near-universal (hsp90) to organism-specific (Gb1) and take part in diverse biological processes (for example, catalysis, DNA binding, molecular sensing and protein chaperoning)[35]. We compared this to the untuned, global UniRep as well as a randomly initialized UniRep architecture trained only on local evolutionary data (Evotuned Random, see Fig. 4a and Methods) Using these trained unsupervised models we generated representations for the green fluorescent protein from avGFP variant sequences from Sarkisyan et al.[37] and trained simple sparse linear regression top models on each to predict avGFP brightness. The best approach examined here would be 100× more expensive (Supplementary Fig. 12)

Discussion
Methods
Code availability
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.