An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable.

Kristjan Korjus,Raul Vicente,Martin N Hebart,Chuhsing Kate Hsiao

doi:10.1371/journal.pone.0161788

Abstract

Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.

Highlights

The goal of supervised machine learning, in particular classification, is to find a model that accurately assigns data to separate predefined classes
When using machine learning algorithms for making predictions, improving performance of a classifier can be seen as a central goal
Since data are often scarce or expensive to acquire, efficient use of data is another important objective. These three goals—generalization performance, interpretability, and efficient use of data—often lead to a trade-off that is resolved depending on the focus of the researcher

Summary

Introduction

The goal of supervised machine learning, in particular classification, is to find a model that accurately assigns data to separate predefined classes. To test the generality of a learned model, this model is typically applied to independent test data, and the accuracy of the prediction informs a researcher about the quality of the classifier [1]. Finding a classifier that performs optimally according to the researcher’s objective requires a set of assumptions and a tradeoff in model complexity: Too simple parameters lead to under-fitting, i.e. the model is not able to account for the complexity of the data. Too complex parameters at the same time lead to over-fitting, i.e. the model is too complex and fits to noise in the data. PLOS ONE | DOI:10.1371/journal.pone.0161788 August 26, 2016

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Aug 26, 2016
Citations: 34	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

An improved machine learning approach for predicting granular flows
Dan Xu ... Yansong Shen
Chemical engineering journal (Lausanne, Switzerland : 1996) | VOL. 450
Dan Xu, et. al.Dan Xu ... Yansong Shen
12 Jul 2022
Chemical engineering journal (Lausanne, Switzerland : 1996) | VOL. 450

Boosting phase-contrast MRI performance in idiopathic normal pressure hydrocephalus diagnostics by means of machine learning approach.
Aleš Vlasák ... Václav Gerla
Neurosurgical Focus | VOL. 52
Aleš Vlasák, et. al.Aleš Vlasák ... Václav Gerla
01 Apr 2022
Neurosurgical Focus | VOL. 52

Plants meet machines: Prospects in machine learning for plant biology
Pamela S Soltis ... Gil Nelson
American Journal of Botany | VOL. 8
Pamela S Soltis, et. al.Pamela S Soltis ... Gil Nelson
01 Jun 2020
American Journal of Botany | VOL. 8

Development and assessment of a reactor system prognosis model with physics-guided machine learning
Anil Gurgen ... Nam T Dinh
Nuclear Engineering and Design | VOL. 398
Anil Gurgen, et. al.Anil Gurgen ... Nam T Dinh
27 Sep 2022
Nuclear Engineering and Design | VOL. 398

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one