Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling.

Mark Lennox,Neil Robertson,Barry Devereux

doi:10.1109/embc44109.2020.9176380

Abstract

Deep learning has proven to be a useful tool for modelling protein properties. However, given the variability in the length of proteins, it can be difficult to summarise the sequence of amino acids effectively. In many cases, as a result of using fixed-length representations, information about long proteins can be lost through truncation, or model training can be slow due to the use of excessive padding. In this work, we aim to overcome these problems by expanding upon the original vocabulary used to represent the protein sequence. To this end, we utilise two prominent subword algorithms that have been previously used to reach state-of-the-art results in various Natural Language Processing tasks. The algorithms are used to encode the original protein sequence into a set of subsequences before they are analysed by a Doc2Vec model. The pre-trained encodings produced by each algorithm are tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localization, thermostability, peak absorption wavelength, enantioselectivity) as well as drug-target affinity prediction tasks over two datasets. Our results significantly improve on the state-of-the-art for these tasks, demonstrating the benefits of using subword compression algorithms for modelling proteins.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference	Publication Date: Jul 1, 2020
Citations: 32	License type: cc-by-nd

R Discovery Prime

R Discovery Prime

Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling.

Abstract

Talk to us

Similar Papers

More From: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference

Lead the way for us

Similar Papers

Deep Metric Learning for Proteomics
Mark Lennox ... Neil Robertson
-
Mark Lennox, et. al.Mark Lennox ... Neil Robertson
01 Dec 2020
01 Dec 2020

Efficient Masked Autoencoders With Self-Consistency.
Zhaowen Li ... Jinqiao Wang
IEEE transactions on pattern analysis and machine intelligence | VOL. 46
Zhaowen Li, et. al.Zhaowen Li ... Jinqiao Wang
01 Dec 2024
IEEE transactions on pattern analysis and machine intelligence | VOL. 46

CNO-LSTM: A Chaotic Neural Oscillatory Long Short-Term Memory Model for Text Classification
Nuobei Shi ... Ling Chen
IEEE Access | VOL. 10
Nuobei Shi, et. al.Nuobei Shi ... Ling Chen
01 Jan 2021
IEEE Access | VOL. 10

Neu-IR
Nick Craswell ... W Bruce Croft
-
Nick Craswell, et. al.Nick Craswell ... W Bruce Croft
07 Jul 2016
07 Jul 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling.

Abstract

Talk to us

Similar Papers

More From: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference