Binary classification is a fundamental task in machine learning, with applications across various scientific domains. Whether conducting fundamental research or refining practical applications, scientists typically assess and rank classification techniques based on performance metrics such as accuracy, sensitivity, and specificity. However, reported performance scores may not always provide a reliable basis for research ranking. The unreliability can be attributed to undisclosed or unconventional practices related to cross-validation, typographical errors, and other factors.In a given experimental setup with a specific number of positive and negative test items, performance scores can only assume specific, interrelated values. Based on this observation, in this paper, we introduce numerical techniques to assess the consistency of reported performance scores with the assumed experimental setup. Importantly, the proposed approaches do not rely on statistical inference. Instead, they use numerical methods (interval computing and integer linear programming) to identify inconsistencies with certainty.Through three applications in different fields of medicine, we demonstrate how the proposed tests can detect inconsistencies, thereby safeguarding the integrity of research fields. The power analyses of the tests in these applications show at least 71% of power when the performance scores are reported to four decimal places. In the investigated areas, the tests have so far identified inconsistencies in more than 100 scientific papers. To benefit the scientific community, we have made the consistency tests available in the open-source Python package mlscorecheck.
Read full abstract