Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Ehsaneddin Asgari,Alice C Mchardy,Mohammad R K Mofrad

doi:10.1038/s41598-019-38746-w

Abstract

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.

Highlights

Bioinformatics and natural language processing (NLP) are research areas that have greatly benefited from each other since their beginnings and there have been always methodological exchanges between them
We took the idea of peptide-pair encoding (PPE) from byte pair encoding (BPE) algorithm, which is a text compression algorithm introduced in 199417 that has been used for compressed pattern matching in genomics[18]
In contrast to the use of BPE in NLP for vocabulary size reduction, we used this idea to increase the size of symbols from 20 amino acids to a large set of variable-length frequent sub-sequences, which are potentially meaningful in bioinformatics tasks

Summary

Introduction

Bioinformatics and natural language processing (NLP) are research areas that have greatly benefited from each other since their beginnings and there have been always methodological exchanges between them. One of the apparent differences between biological sequences and many natural languages is that biological sequences (DNA, RNA, and proteins) often do not contain clear segmentation boundaries, unlike the existence of tokenizable words in many natural languages This uncertainty in the segmentation of sequences has made overlapping k-mers one of the most popular representations in machine learning for all areas of bioinformatics research, including proteomics[5,9], genomics[10,11], epigenomics[12,13], and metagenomics[14,15]. In contrast to the use of BPE in NLP for vocabulary size reduction, we used this idea to increase the size of symbols from 20 amino acids to a large set of variable-length frequent sub-sequences, which are potentially meaningful in bioinformatics tasks. Some examples of discriminative motif miners are DEME28 (using a Bayesian framework over alignment columns), MotifHound[29] (hypergeometric test on certain regular expressions in the input data), DLocalMotif[30] (combining motif over-representation, entropy and spatial confinement in motif scoring)

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Reports	Publication Date: Mar 5, 2019
Citations: 61	License type: open-access

R Discovery Prime

R Discovery Prime

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

Machine discovery of protein motifs
Darrell Conklin
Machine Learning | VOL. 21
Darrell ConklinDarrell Conklin
01 Jan 1995
Machine Learning | VOL. 21

Discriminative motif discovery in DNA and protein sequences using the DEME algorithm
Emma Redhead ... Timothy L Bailey
BMC Bioinformatics | VOL. 8
Emma Redhead, et. al.Emma Redhead ... Timothy L Bailey
15 Oct 2007
BMC Bioinformatics | VOL. 8

A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs
Phillip Seitzer ... Marc T Facciotti
BMC Bioinformatics | VOL. 13
Phillip Seitzer, et. al.Phillip Seitzer ... Marc T Facciotti
27 Nov 2012
BMC Bioinformatics | VOL. 13

NestedMICA as an ab initio protein motif discovery tool.
Mutlu Doğruel ... Tim Jp Hubbard
BMC Bioinformatics | VOL. 9
Mutlu Doğruel, et. al.Mutlu Doğruel ... Tim Jp Hubbard
14 Jan 2008
BMC Bioinformatics | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports