Abstract

Protein sequences contain rich information about protein evolution, fitness landscapes, and stability. Here we investigate how latent space models trained using variational auto-encoders can infer these properties from sequences. Using both simulated and real sequences, we show that the low dimensional latent space representation of sequences, calculated using the encoder model, captures both evolutionary and ancestral relationships between sequences. Together with experimental fitness data and Gaussian process regression, the latent space representation also enables learning the protein fitness landscape in a continuous low dimensional space. Moreover, the model is also useful in predicting protein mutational stability landscapes and quantifying the importance of stability in shaping protein evolution. Overall, we illustrate that the latent space models learned using variational auto-encoders provide a mechanism for exploration of the rich data contained in protein sequences regarding evolution, fitness and stability and hence are well-suited to help guide protein engineering efforts.

Highlights

  • Protein sequences contain rich information about protein evolution, fitness landscapes, and stability

  • Thousands of sequences from different species are available and these sequences can be aligned to construct multiple sequence alignments (MSAs)[2]. These naturally occurring diverse protein sequences in an MSA, belonging to a protein family but functioning in a diverse set of environments, are the result of mutation and selection occurring during the process of protein evolution

  • The protein sequences in a protein family’s MSA are the result of mutation and selection occurring during the process of protein evolution

Read more

Summary

Introduction

Protein sequences contain rich information about protein evolution, fitness landscapes, and stability. The major task in phylogeny reconstruction is to infer the phylogenetic tree using either maximum likelihood methods or Bayesian approaches[18,19]. Multiple algorithms for this purpose have been developed and are widely used in a number of applications[20,21,22,23,24]. Because DCA methods model the distribution of sequences directly instead of assuming that there is an underlying latent process generating the sequences as in phylogeny reconstruction, DCA methods cannot infer phylogenetic relationships between sequences. A DCA model with third-order epistasis would have too many parameters to fit given current sequence availability

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call