Abstract

To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families.Electronic supplementary materialThe online version of this article (doi:10.1007/s13205-016-0410-1) contains supplementary material, which is available to authorized users.

Highlights

  • RNA silencing is a common host defense mechanism in plants against many plant RNA/DNA viruses (Li et al 2014a; Perez-Canamas and Hernandez 2014; Valli et al 2001)

  • We have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors

  • We have experimented with four different machine learning algorithms, namely—(1) naive Bayes (NB), (2) Fischer linear discriminant function, (3) support vector machines with sequential minimization optimization (SMO) and (4) K nearest neighbor on the imbalanced dataset, randomly undersampled dataset and synthetic minority oversampling technique (SMOTE) oversampled dataset to find the optimal class distribution for each of these classifiers

Read more

Summary

Introduction

RNA silencing is a common host defense mechanism in plants against many plant RNA/DNA viruses (Li et al 2014a; Perez-Canamas and Hernandez 2014; Valli et al 2001). To counter the RNA silencing defense mechanism, these plant viruses encode RNA-silencing suppressors, which disturb the host RNA silencing pathway. The molecular basis for the mechanism of encoding RNA-silencing suppressors by these plant viruses is still largely unknown. Annotation of putative members of this family is hampered by the presence of high sequence diversity existing among these plant virus-encoded RNA-silencing suppressors (Qu and Morris 2005). In (Jagga and Gupta 2014) the shortcomings of sequence similarity-based search methods like PSI-BLAST in correctly annotating the members of this protein family are emphasized. Machine learning methods trained on mathematically represented suitable input feature vectors become a viable alternative to sequence similarity-based search methods. The true performance of machine learning methods is affected by various factors such as class imbalance (Nath and Subbiah 2015a), imperfect learning due to some missing example instances and selection of inappropriate input features

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.