Understanding metric-related pitfalls in image analysis validation.

A Emre Kavur,Abdel A Taha,Adrian Galdran,Aleksei Tiulpin,Alexandros Karargyris,Amin Madani,Anna Kreshuk,Anne L Martel,Annette Kopp-Schneider,Annika Reinke,Arriel Benis,Ben Glocker,Ben Van Calster,Bennett A Landman,Bernhard Kainz,Beth A Cimini,Bjoern Menze,Bram Van Ginneken,Brennan Nichyporuk,Carole H Sudre,Charles E Kahn,Clara I Sánchez,Dagmar Kainmueller,Daniel A Hashimoto,Doreen Heckmann-Nötzel,Erik Meijering,Evangelia Christodoulou,Fabian Isensee,Felix Nickel,Florian Buettner,Florian Kofler,Gaël Varoquaux,Geert Litjens,Henning Müller,Jens Kleesiek,Jens Petersen,Jianxu Chen,Julio Saez-Rodriguez,Karel G M Moons,Keyvan Farahani,Klaus Maier-Hein,Laura Acion,Lena Maier-Hein,Luciana Ferrer,M Jorge Cardoso,Matthias Eisenmann,Mauricio Reyes,Merel Huisman,Michael A Riegler,Michael Baumgartner,Michael M Hoffman,Michal Kozubek,Michela Antonelli,Minu D Tizabi,Nasir Rajpoot,Nicola Rieke,Patrick Godau,Paul F Jäger,Pierre Jannin,Ronald M Summers,Shravya Shetty,Sotirios A Tsaftaris,Spyridon Bakas,Susanne M Rafelski,Tahsin Kurc,Tal Arbel,Thijs Kooi,Tim Rädsch,Veronika Cheplygina,Ziv R Yaniv

doi:10.1038/s41592-023-02150-0

Abstract

Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.

Full Text