Navigating the amino acid sequence space between functional proteins using a deep learning framework.

Tristan Bitard-Feildel

doi:10.7717/peerj-cs.684

Abstract

MotivationShedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution.ResultsThis study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.

Highlights

Protein diversity, regarding sequence, structure, or function, is the result of a long evolutionary process
This study explored arithmetic operations with protein sequences encoded in their latent space to generate new protein sequences
Previous works based on Variational Autoencoder (VAE) have successfully reported the ability of this deep learning framework to model protein sequence and functional spaces (Sinai et al, 2017), predict amino acid fitness impact (Hopf et al, 2017; Riesselman, Ingraham & Marks, 2018), look into protein evolution (Ding, Zou & Brooks, 2019) or design new protein (Greener, Moffat & Jones, 2018)

Summary

Introduction

Protein diversity, regarding sequence, structure, or function, is the result of a long evolutionary process. Navigating the amino acid sequence space between functional proteins using a deep learning framework. These observed sequences are referred to the amino acid sequence space. The classification of amino acid sequences into protein domain families allows to organize the sequence space and reduce its complexity

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PeerJ Computer Science	Publication Date: Sep 17, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Navigating the amino acid sequence space between functional proteins using a deep learning framework.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ Computer Science

Lead the way for us

Similar Papers

A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences.
Johannes Linder ... Georg Seelig
Cell Systems | VOL. 11
Johannes Linder, et. al.Johannes Linder ... Georg Seelig
25 Jun 2020
Cell Systems | VOL. 11

Constraints on the expansion of paralogous protein families.
Conor J Mcclune ... Michael T Laub
Current Biology | VOL. 30
Conor J Mcclune, et. al.Conor J Mcclune ... Michael T Laub
01 May 2020
Current Biology | VOL. 30

Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
Qi Dai ... Tianming Wang
BMC Bioinformatics | VOL. 9
Qi Dai, et. al.Qi Dai ... Tianming Wang
23 Sep 2008
BMC Bioinformatics | VOL. 9

Learning to Read and Write in the Language of Proteins
Helen T Hobbs ... Chang C Liu
GEN Biotechnology | VOL. 2
Helen T Hobbs, et. al.Helen T Hobbs ... Chang C Liu
01 Apr 2023
GEN Biotechnology | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Navigating the amino acid sequence space between functional proteins using a deep learning framework.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ Computer Science