Abstract

An ensemble classifier approach for microRNA precursor (pre-miRNA) classification was proposed based upon combining a set of heterogeneous algorithms including support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF), then aggregating their prediction through a voting system. Additionally, the proposed algorithm, the classification performance was also improved using discriminative features, self-containment and its derivatives, which have shown unique structural robustness characteristics of pre-miRNAs. These are applicable across different species. By applying preprocessing methods—both a correlation-based feature selection (CFS) with genetic algorithm (GA) search method and a modified-Synthetic Minority Oversampling Technique (SMOTE) bagging rebalancing method—improvement in the performance of this ensemble was observed. The overall prediction accuracies obtained via 10 runs of 5-fold cross validation (CV) was 96.54%, with sensitivity of 94.8% and specificity of 98.3%—this is better in trade-off sensitivity and specificity values than those of other state-of-the-art methods. The ensemble model was applied to animal, plant and virus pre-miRNA and achieved high accuracy, >93%. Exploiting the discriminative set of selected features also suggests that pre-miRNAs possess high intrinsic structural robustness as compared with other stem loops. Our heterogeneous ensemble method gave a relatively more reliable prediction than those using single classifiers. Our program is available at http://ncrna-pred.com/premiRNA.html.

Highlights

  • MicroRNAs are small endogenous non-coding RNAs (&19–25 nt)

  • The receiver operating characteristic (ROC) curve is a graphic visualization of the trade-off between the true positive and false positive rates for every possible cut off; we used an area under the ROC curve (AUC) to compare the performance of classifiers

  • The accuracy, commonly used measurement, is not an appropriate metric to evaluate the performance of a classifier in class-imbalanced data since the negative class in training data is much larger than the positive class

Read more

Summary

INTRODUCTION

MicroRNAs (miRNAs) are small endogenous non-coding RNAs (&19–25 nt). They play crucial roles in posttranscriptional regulation of gene expression of plants and animals [1]. Numerous de novo noncomparative methods for identifying pre-miRNA hairpins based on single ML algorithm have been proposed [8,9,10,11,12,13,14,15,16] For such methods, stem-loop structures are involved in prediction. Most ML-based methods rely on known pre-miRNA characteristics as features for training prediction models Among these specific features, hairpin secondary structure and minimum free energy (MFE) of stem-loop hairpins are considered as key features [4]. The problem of imbalanced data was solved by the modified-Synthetic Minority Oversampling Technique (SMOTE) bagging method This enhanced ensemble-based method would effectively differentiate pre-miRNA from non-miRNA sequences with higher accuracy and better balanced sensitivity and specificity score, across various organisms, making our model a useful tool for finding novel animal, plant and virus pre-miRNAs

MATERIALS AND METHODS
RESULTS AND DISCUSSION
Method
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call