A multi-stage protein secondary structure prediction system using machine learning and information theory

Masood Zamani,Stefan C Kremer

doi:10.1109/bibm.2015.7359867

Abstract

In this paper, we evaluated the performance of a multi-stage protein secondary structure (PSS) prediction model. The proposed classifier uses statistical information and protein profiles. The statistical information is derived from protein sequences and structures by using a k-means clustering technique and Information theory. In the first stage, a feed-forward artificial neural network maps a sequence fragment to a region in the Ramachandran plot (2D-plot). A score vector is constructed with the mapped region using clustering and statistical information. The score vector represents the tendency of pairing an identified region in the 2D-plot and secondary structures for a residue. The score vectors which are used in the second stage have fewer dimensions compared to input vectors that are commonly derived from protein sequences or profile information. In the second stage, a two-tier classifier is employed based on an artificial neural network and a genetic programming (GP) method. The GP method uses IF rules for a three-state classification. The two-tier classifier's performance is compared to those of two-tier artificial neural networks (ANNs) and support vector machines (SVMs). The prediction method is examined with a common protein dataset, RS126. The performance of the proposed classification model is measured based on Q3 and segment overlap (SOV) scores. The proposed PSS prediction model improves over 3% the Q3 score and 2% the SOV score in comparison to those of two-tier ANN and SVMs architectures.

Full Text