A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

Umberto Michelucci,Francesca Venturini,Marco A Deriu,Dario Piga,Michela Sperti

doi:10.3390/a14110301

Abstract

This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features regardless of the model used. This limit, namely, the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset. The ILD algorithm thus provides important information regarding the prediction limits of any binary classification algorithm when applied to the considered dataset. In this paper, the algorithm is described in detail, its entire mathematical framework is presented and the pseudocode is given to facilitate its implementation. Finally, an example with a real dataset is given.

Highlights

The majority of machine learning projects tend to follow the same pattern, namely, many different machine learning model types are first trained from data to predict specific outcomes, and tested and compared to find the one that gives the best prediction performance on validation data
This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the area under the receiver operating characteristic (ROC) curve (AUC) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features regardless of the model used
The ILD algorithm allows computing the maximum performance in a binary classification problem, expressed both as the largest area under the ROC curve (AUC) and as the accuracy that can be achieved with any given dataset with categorical features

Summary

Introduction

The majority of machine learning projects tend to follow the same pattern, namely, many different machine learning model types (such as decision trees, logistic regression, random forest, neural network, etc.) are first trained from data to predict specific outcomes, and tested and compared to find the one that gives the best prediction performance on validation data. In any supervised learning task, knowing the BE linked to a given dataset would be of extreme importance Such a value would help practitioners decide whether or not it is worthwhile to spend time and computing resources in improving the developed classifiers or acquiring additional training data. The ILD algorithm allows computing the maximum performance in a binary classification problem, expressed both as the largest area under the ROC curve (AUC) and as the accuracy that can be achieved with any given dataset with categorical features. This is by far the most significant contribution of this paper, as the ILD algorithm for the first time allows evaluating the BE for a given dataset exactly.

Mathematical Notation and Dataset Aggregation

Bucket 4

ILD Algorithm Mathematical Framework

Sensitivity and Specificity

Perfect Bucket and Perfect Dataset

Effect of One Single Flip

Handling Missing Values

Application of the ILD Algorithm to the Framingham Heart Study Dataset

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Oct 20, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Anthropometric indicators as discriminators of high body fat in children and adolescents with HIV: comparison with reference methods.
Carlos A Souza Alves Jr ... Diego A Santos Silva
Minerva pediatrics | VOL. 75
Carlos A Souza Alves Jr, et. al.Carlos A Souza Alves Jr ... Diego A Santos Silva
01 Nov 2023
Minerva pediatrics | VOL. 75

B-type natriuretic peptide informativeness in myocardial revascularization with cardio-pulmonary bypass
I A Kozlov ... V Yu Rybakov
Messenger of ANESTHESIOLOGY AND RESUSCITATION | VOL. 21
I A Kozlov, et. al.I A Kozlov ... V Yu Rybakov
25 Aug 2024
Messenger of ANESTHESIOLOGY AND RESUSCITATION | VOL. 21

Peer-To-Peer Lending: Classification in the Loan Application Process
Xinyuan Wei ... Stan Uryasev
Risks | VOL. 6
Xinyuan Wei, et. al.Xinyuan Wei ... Stan Uryasev
09 Nov 2018
Risks | VOL. 6

A modified area under the ROC curve and its application to marker selection and classification
Wenbao Yu ... Eunsik Park
Journal of the Korean Statistical Society | VOL. 43
Wenbao Yu, et. al.Wenbao Yu ... Eunsik Park
12 Jun 2013
Journal of the Korean Statistical Society | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms