Abstract

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark.

Highlights

  • The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years

  • The specific deep learning (DL) architectures able to leverage the inner structure of sequential biological data are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN)

  • We evaluate this effect on three different DL architectures: only feed-forward neural networks, feed-forward neural networks coupled with a convolutional layer (1_conv) and feedforward neural networks coupled with a stack of convolutional layers

Read more

Summary

Introduction

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. CNNs entail translational ­invariance[10] and can be used to find relevant patterns with biological m­ eaning[5,8,11,12] For their part, bidirectional RNNs (and the derived Long Short-Term Memory and Gated Recurrent Units) are appropiate for modelling biological sequences since they are suited for data with a sequential but non-causal structure, variable length, and long-range ­dependencies[13,14,15,16]. A comprehensive review and assessment on different amino acid encoding ­methods[19] shows that position specific scoring matrix (PSSM), an evolution-based position dependent methodology, achieves the best performance on protein secondary structure prediction and protein fold recognition tasks. A protein of length L is represented by a (n + 1) × L binary matrix

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.