Addressing One-Dimensional Protein Structure Prediction Problems with Machine Learning Techniques

Rhys Heffernan

doi:10.25904/1912/3298

Abstract

In this thesis we tackle the protein structure prediction subproblems listed previously, by applying state of the art deep learning techniques. The work in chapter 2 presents the method SPIDER. In this method, state of the art deep learning is applied iteratively to the task of predicting backbone torsion angles and , and dihedral angles and , by applying evolutionary-derived sequence pro les and physio-chemical properties of amino acid residues. This work is the fi rst method for the sequence based prediction of and angles. Chapter 3 presents the method SPIDER2. This method takes the state of the art iterative deep learning applied in SPIDER, and extends it to the prediction of three-state secondary structure, solvent accessible surface area, and ; ; , and angles, and achieves the best reported prediction accuracies for all of them (at the date of publication). Chapter 4 further builds on the work done in the previous chapters, and now adds the prediction of half sphere exposure (both C and C based) and contact numbers to SPIDER2, in a method called SPIDER2-HSE. In Chapter 5, Long Short-Term Memory Bidirectional Recurrent Neural Networks were applied to the prediction of three-state secondary structure, solvent accessible surface area, ; ; , and angles, as well as half sphere exposure and contact numbers. Previously methods used for these predictions (including SPIDER2) were typically window based. That is to say that the input data made available to the model for a given residue, is comprised of information for only that residue and a number of residues on either side in the sequence (in the range of 10-20 residues on each side). The use of LSTM-BRNNs in this method allows SPIDER3 to better learn both long and short term interactions within proteins. This advancement again lead to the best reported accuracies for all predicted structural properties. In Chapter 6, the LSTM-BRNN model used in SPIDER3 is applied to the prediction of the same structural property predictions, plus the prediction of eight-state secondary structure, using only single-sequence inputs. That is, structural properties were predicted without using any evolutionary information. This provides a method that provides not only the best reported single-sequence secondary structure and solvent accessible surface area predictions, but the fi rst reported method for the single-sequence based prediction of half sphere exposure, contact numbers, and ; ; , and angles. This study is important as most proteins have few homologous sequences and their evolutionary profi les are inac- curate and time-consuming to calculate. This single-sequence-based technique allows for fast genome-scale screening analysis of protein one-dimensional structural properties.

Full Text