Solving the Protein Secondary Structure Prediction Problem With the Hessian Free Optimization Algorithm

Konstantinos Charalampous,Chris Christodoulou,Vasilis Promponas,Michalis Agathocleous

doi:10.1109/access.2022.3156888

Abstract

Trying to extract features from complex sequential data for classification and prediction problems is an extremely difficult task. This task is even more challenging when both the upstream and downstream information of a time-series is important to process the sequence at a specific time-step. One typical problem which falls in this category is Protein Secondary Structure Prediction (PSSP). Recurrent Neural Networks (RNNs) have been successful in handling sequential data. These methods are demanding in terms of time and space efficiency. On the other hand, simple Feed-Forward Neural Networks (FFNNs) can be trained really fast with the Backpropagation algorithm, but in practice they give poor results in this category of problems. The Hessian Free Optimization (HFO) algorithm is one of the latest developments in the field of Artificial Neural Network (ANN) training algorithms which can converge faster and more accurately. In this paper, we present the implementation of simple FFNNs trained with the powerful HFO second-order learning algorithm for the PSSP problem. In our approach, a single FFNN trained with the HFO learning algorithm can achieve an approximately 79.6% per residue ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$Q_{3}$ </tex-math></inline-formula> ) accuracy on the PISCES dataset. Despite the simplicity of our method, the results are comparable to some of the state of the art methods which have been designed for this problem. A majority voting ensemble method and filtering with Support Vector Machines have also been applied, which increase our results to 80.4% per residue ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$Q_{3}$ </tex-math></inline-formula> ) accuracy. Finally, our method has been tested on the CASP13 independent dataset and achieved 78.14% per residue ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$Q_{3}$ </tex-math></inline-formula> ) accuracy. Moreover, the HFO does not require tuning of any parameters which makes training much faster than other state of the art methods, a very important feature with big datasets and facilitates fast training of FFNN ensembles.

Highlights

Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that provides a system the ability to learn from data without being explicitly programmed
The need to build efficient methods in terms of accuracy, convergence time and simplicity for sequential data where both the upstream and downstream information of a sequence is important for a specific time-step and for the Protein Secondary Structure Prediction (PSSP) problem, has been the initial motivation for this work
We present a second order-based methodology for training simple Feed-Forward Neural Networks (FFNNs) for the challenging PSSP problem where both the upstream and downstream information of a sequence is important for a specific time-step

Summary

Introduction

Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that provides a system the ability to learn from data without being explicitly programmed. ML methods consist of a parameterized learnable model and a learning algorithm [1], [2]. These methods are mostly used for classification and prediction on static and sequential data. Even though a number of theoretical ML algorithms have been designed to process and make predictions on sequential data, the mining of such data types is still an open field of research due to its complexity. VOLUME 4, 2016 of optimisation algorithms for specific ML techniques for sequencial data must take into account how to (a) capture and exploit sequential correlations, (b) represent and incorporate loss functions, (c) identify long-distance dependencies, and (d) make the optimisation algorithm fast [3]. Successful ANN models which have been designed to handle sequencial data are the classes of Recurrent Neural Networks (RNNs) [4] and Deep Learning (DL) methods [5]. RNNs are universal approximators of dynamical systems [6] that can associate patterns which are located far away from each other on a sequence and create a kind of short and long range

Methods

Results

Conclusion