Abstract

MotivationShedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution.ResultsThis study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.

Highlights

  • Protein diversity, regarding sequence, structure, or function, is the result of a long evolutionary process

  • This study explored arithmetic operations with protein sequences encoded in their latent space to generate new protein sequences

  • Previous works based on Variational Autoencoder (VAE) have successfully reported the ability of this deep learning framework to model protein sequence and functional spaces (Sinai et al, 2017), predict amino acid fitness impact (Hopf et al, 2017; Riesselman, Ingraham & Marks, 2018), look into protein evolution (Ding, Zou & Brooks, 2019) or design new protein (Greener, Moffat & Jones, 2018)

Read more

Summary

Introduction

Protein diversity, regarding sequence, structure, or function, is the result of a long evolutionary process. Navigating the amino acid sequence space between functional proteins using a deep learning framework. These observed sequences are referred to the amino acid sequence space. The classification of amino acid sequences into protein domain families allows to organize the sequence space and reduce its complexity

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call