Abstract

BackgroundHypothetical proteins [HP] are those that are predicted to be expressed in an organism, but no evidence of their existence is known. In the recent past, annotation and curation efforts have helped overcome the challenge in understanding their diverse functions. Techniques to decipher sequence-structure-function relationship, especially in terms of functional modelling of the HPs have been developed by researchers, but using the features as classifiers for HPs has not been attempted. With the rise in number of annotation strategies, next-generation sequencing methods have provided further understanding the functions of HPs.ResultsIn our previous work, we developed a six-point classification scoring schema with annotation pertaining to protein family scores, orthology, protein interaction/association studies, bidirectional best BLAST hits, sorting signals, known databases and visualizers which were used to validate protein interactions. In this study, we introduced three more classifiers to our annotation system, viz. pseudogenes linked to HPs, homology modelling and non-coding RNAs associated to HPs. We discuss the challenges and performance of these classifiers using machine learning heuristics with an improved accuracy from Perceptron (81.08 to 97.67), Naive Bayes (54.05 to 96.67), Decision tree J48 (67.57 to 97.00), and SMO_npolyk (59.46 to 96.67).ConclusionWith the introduction of three new classification features, the performance of the nine-point classification scoring schema has an improved accuracy to functionally annotate the HPs.

Highlights

  • Hypothetical proteins [hypothetical proteins (HPs)] are those that are predicted to be expressed in an organism, but no evidence of their existence is known

  • The additional features which we employed are extended to show the possibility of pseudogenes linked to HPs, proteins that are essentially structural ‘mers’ of the candidate proteins and presence of non-coding RNA signatures

  • When we analysed the data through 10-fold cross-validation using the Waikato Environment for Knowledge Analysis (WEKA) machine learning package, the decision trees (J48) yielded an accuracy of 97%, with Support Vector Machines (SVM) (SMO) performing high: 98, 93, 96 for Poly, RBF, npolyk kernals respectively; MLP with 97.67% and Naive Baiyes multinomial with 98.33% (Table 2)

Read more

Summary

Introduction

Hypothetical proteins [HP] are those that are predicted to be expressed in an organism, but no evidence of their existence is known. Genome-wide expression analysis, machine learning, data mining, deep learning and Markov random fields are the other prediction methods which are widely employed [20, 21], whereas Support Vector Machines (SVM) [22], Neural Networks [23], Bayesian Networks [24, 25], Probabilistic Decision Trees [26], Rosetta Stone [14, 27], Gene Clustering and Network Neighbourhood analyses [28] have been used to combine different biological data sources to interpret biological relationships These have shown to be successful in predicting protein function, annotation based on feature selection for inferring the function of HPs is wanting. There has been a steady increase in the use of imparting machine learning and information theoretic features used for development of efficient framework for predicting interactions between proteins [28,29,30]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call