The role of balanced training and testing data sets for binary classifiers in bioinformatics.

Qiong Wei,Roland L Dunbrack

doi:10.1371/journal.pone.0067863

Abstract

Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand.

Highlights

In several areas of bioinformatics, binary classifiers are common tools that have been developed for applications in the biological community
Our results indicate that ‘‘balanced accuracy’’ is quite flat with respect to testing proportions, but is quite sensitive to balance in the training set, reaching a maximum for balanced training sets
The Selection of Neutral Data Sets From SwissVar, we obtained a set of human missense mutations associated with disease and a set of polymorphisms of unknown phenotype, often presumed to be neutral

Summary

Introduction

In several areas of bioinformatics, binary classifiers are common tools that have been developed for applications in the biological community. Based on input or calculated feature data, the classifiers predict the probability of a positive (or negative) outcome with probability P(+) = 1–P(–) Examples of this kind of classifier in bioinformatics include the prediction of the phenotypes of missense mutations in the human genome [1,2,3,4,5,6,7,8], the prediction of disordered residues in proteins [9,10,11,12,13,14,15,16,17], and the presence/ absence of beta turn, regular secondary structures, and transmembrane helices in proteins [18,19,20,21]. Very large amounts of such data have become available, especially from cancer genome projects comparing tumor and non-tumor samples [26] This led us to question the nature of our training and testing data sets, and how the proportions of positive and negative data points would affect our results. If we trained a classifier with balanced data sets (50% deleterious, 50% neutral), but genomic data have much lower rates of deleterious mutations would we overpredict deleterious phenotypes? Or should we try to create training data that resembles the potential application data? Should we choose neutral data that closely resembles potential input, for example human missense mutations in SwissVar, or should we use more distinct, for example data from close orthologues of human sequences in other organisms, in particular primates?

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS ONE	Publication Date: Jul 9, 2013
Citations: 226	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The role of balanced training and testing data sets for binary classifiers in bioinformatics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Generating Balanced Classifier-Independent Training Samples from Unlabeled Data
Youngja Park ... Ian M Molloy
-
Youngja Park, et. al.Youngja Park ... Ian M Molloy
01 Jan 2012
01 Jan 2012

PAKDD’12 best paper: generating balanced classifier-independent training samples from unlabeled data
Youngja Park ... Zijie Qi
Knowledge and Information Systems | VOL. 41
Youngja Park, et. al.Youngja Park ... Zijie Qi
05 Sep 2013
Knowledge and Information Systems | VOL. 41

BALANCED VS IMBALANCED TRAINING DATA: CLASSIFYING RAPIDEYE DATA WITH SUPPORT VECTOR MACHINES
M Ustuner ... S Abdikan
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. XLI-B7
M Ustuner, et. al.M Ustuner ... S Abdikan
21 Jun 2016
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. XLI-B7

BALANCED VS IMBALANCED TRAINING DATA: CLASSIFYING RAPIDEYE DATA WITH SUPPORT VECTOR MACHINES
M Ustuner ... F B Sanli
ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. XLI-B7
M Ustuner, et. al.M Ustuner ... F B Sanli
21 Jun 2016
ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. XLI-B7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The role of balanced training and testing data sets for binary classifiers in bioinformatics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE