Abstract

Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.

Highlights

  • The goal of supervised machine learning, in particular classification, is to find a model that accurately assigns data to separate predefined classes

  • When using machine learning algorithms for making predictions, improving performance of a classifier can be seen as a central goal

  • Since data are often scarce or expensive to acquire, efficient use of data is another important objective. These three goals—generalization performance, interpretability, and efficient use of data—often lead to a trade-off that is resolved depending on the focus of the researcher

Read more

Summary

Introduction

The goal of supervised machine learning, in particular classification, is to find a model that accurately assigns data to separate predefined classes. To test the generality of a learned model, this model is typically applied to independent test data, and the accuracy of the prediction informs a researcher about the quality of the classifier [1]. Finding a classifier that performs optimally according to the researcher’s objective requires a set of assumptions and a tradeoff in model complexity: Too simple parameters lead to under-fitting, i.e. the model is not able to account for the complexity of the data. Too complex parameters at the same time lead to over-fitting, i.e. the model is too complex and fits to noise in the data. PLOS ONE | DOI:10.1371/journal.pone.0161788 August 26, 2016

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call