Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

Angela Lopez-Del Rio,Rabie Saidi,Maria Martin,Alexandre Perera-Lluna

doi:10.1038/s41598-020-71450-8

Abstract

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark.

Highlights

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years
The specific deep learning (DL) architectures able to leverage the inner structure of sequential biological data are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN)
We evaluate this effect on three different DL architectures: only feed-forward neural networks, feed-forward neural networks coupled with a convolutional layer (1_conv) and feedforward neural networks coupled with a stack of convolutional layers

Summary

Introduction

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. CNNs entail translational invariance[10] and can be used to find relevant patterns with biological m eaning[5,8,11,12] For their part, bidirectional RNNs (and the derived Long Short-Term Memory and Gated Recurrent Units) are appropiate for modelling biological sequences since they are suited for data with a sequential but non-causal structure, variable length, and long-range dependencies[13,14,15,16]. A comprehensive review and assessment on different amino acid encoding methods[19] shows that position specific scoring matrix (PSSM), an evolution-based position dependent methodology, achieves the best performance on protein secondary structure prediction and protein fold recognition tasks. A protein of length L is represented by a (n + 1) × L binary matrix

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Reports	Publication Date: Sep 3, 2020
Citations: 27	License type: open-access

R Discovery Prime

R Discovery Prime

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

Composition of Feature Selection for Time-Series Prediction with Deep Learning
Farheen ... Rajeev Kumar
Procedia Computer Science | VOL. 235
Farheen, et. al. Farheen ... Rajeev Kumar
01 Jan 2024
Procedia Computer Science | VOL. 235

Deep learning–based radiomic nomograms for predicting Ki67 expression in prostate cancer
Shuitang Deng ... Jing Sun
BMC Cancer | VOL. 23
Shuitang Deng, et. al.Shuitang Deng ... Jing Sun
08 Jul 2023
BMC Cancer | VOL. 23

Deep chemometrics: Validation and transfer of a global deep near‐infrared fruit model to use it on a new portable instrument
Puneet Mishra ... Dário Passos
Journal of Chemometrics | VOL. 35
Puneet Mishra, et. al.Puneet Mishra ... Dário Passos
21 Jul 2021
Journal of Chemometrics | VOL. 35

Abstract 184: The utility of deep metric learning for breast cancer identification on mammographic images
Justin Du ... Sachin Umrao
Cancer Research | VOL. 81
Justin Du, et. al.Justin Du ... Sachin Umrao
01 Jul 2021
Cancer Research | VOL. 81

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports