Abstract

For last few decades, sequence arrangement of amino acids have been utilized for the prediction of protein secondary structure. Recent methods have applied high dimensional natural language based features in machine learning models. Performance measures of machine learning based models are significantly affected by data size and data dimensionality. It is a huge challenge to develop a generic model which can be trained to perform both for small and large sized datasets in a low dimensional framework. In the present research, we suggest a low dimensional representation for both small and large sized datasets. A hybrid space of Atchley’s factors II, IV, V, electron ion interaction potential and SkipGram based word2vec have been employed for amino acid sequence representation. Subsequently Stockwell transformation is applied to the representation to preserve features both in time and frequency domains. Finally, deep gated recurrent network with dropout, categorical-cross entropy error estimation and Adam optimization is used for classification purpose. The introduced method results in better prediction accuracies for both small (204,277, and 498) and large sized (PDB25, Protein 640 and FC699) bench mark data sets of low sequence similarity (25–40%). The obtained classification accuracies for PDB25, 640, FC699, 498, 277, 204 datasets are 84.2%, 94.31%, 93.1%, 95.9%, 94.5% and 85.36% respectively. The major contributions in this research is that, for the first time, we verify the protein secondary structural class prediction in a very low dimensional (18-D) feature space with a novel feature representation method. Secondly, we also verify for the first time, the behaviour of deep networks for low dimensional small sized data sets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call