Learning from positive examples when the negative class is undetermined- microRNA gene identification

Malik Yousef,Michael K Showe,Segun Jung,Louise C Showe

doi:10.1186/1748-7188-3-2

Malik Yousef, Michael K Showe + Show 2 more

Open Access

https://doi.org/10.1186/1748-7188-3-2

Copy DOI

Abstract

BackgroundThe application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species.ResultsOf all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70–80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than a two-class approach in identifying true miRNAs as well as predicting new miRNAs.ConclusionOne and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined.AvailabilityThe OneClassmiRNA program is available at: [1]

Highlights

The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community
The results of the one-class approaches show a slight superiority for OC-Gaussian and One-class K-nearest neighbor (OC-KNN) over the other one class methods based on the average of the Matthews Correlation Coefficient (MCC) measurement
The one-class approach in machine learning has been receiving more attention for solving problems where the negative class is not well defined [22,23,24,25]; the one class approach has been successfully applied in various fields including text mining [26], functional Magnetic Resonance Imaging [27] and signature verification [28]

Summary

Introduction

The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. Several computational approaches have been applied to miRNA gene prediction using methods based on sequence conservation and/or structural similarity [3,4,5,6,7]. All of these methods rely on binary classifications that artificially generate a non-miRNA class based on the absence of features used to define the positive class. See [12] for a full review of miRNA discovery approaches

Methods

Results

Conclusion