Inconsistency in the use of the term "validation" in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging.

Dong Wook Kim,Hye Young Jang,Yousun Ko,Pyeong Hwa Kim,Jung Hee Son,Seong Ho Park,Joon Seo Lim,Seon-Ok Kim

doi:10.1371/journal.pone.0238908

Dong Wook Kim, Hye Young Jang + Show 6 more

Open Access

https://doi.org/10.1371/journal.pone.0238908

Copy DOI

Abstract

The development of deep learning (DL) algorithms is a three-step process-training, tuning, and testing. Studies are inconsistent in the use of the term "validation", with some using it to refer to tuning and others testing, which hinders accurate delivery of information and may inadvertently exaggerate the performance of DL algorithms. We investigated the extent of inconsistency in usage of the term "validation" in studies on the accuracy of DL algorithms in providing diagnosis from medical imaging. We analyzed the full texts of research papers cited in two recent systematic reviews. The papers were categorized according to whether the term "validation" was used to refer to tuning alone, both tuning and testing, or testing alone. We analyzed whether paper characteristics (i.e., journal category, field of study, year of print publication, journal impact factor [JIF], and nature of test data) were associated with the usage of the terminology using multivariable logistic regression analysis with generalized estimating equations. Of 201 papers published in 125 journals, 118 (58.7%), 9 (4.5%), and 74 (36.8%) used the term to refer to tuning alone, both tuning and testing, and testing alone, respectively. A weak association was noted between higher JIF and using the term to refer to testing (i.e., testing alone or both tuning and testing) instead of tuning alone (vs. JIF <5; JIF 5 to 10: adjusted odds ratio 2.11, P = 0.042; JIF >10: adjusted odds ratio 2.41, P = 0.089). Journal category, field of study, year of print publication, and nature of test data were not significantly associated with the terminology usage. Existing literature has a significant degree of inconsistency in using the term "validation" when referring to the steps in DL algorithm development. Efforts are needed to improve the accuracy and clarity in the terminology usage.

Highlights

Deep learning (DL), often used almost synonymously with artificial intelligence (AI), is the most dominant type of machine learning technique at present
Existing literature has a significant degree of inconsistency in using the term “validation” when referring to the steps in deep learning (DL) algorithm development
Such inconsistency in terminology usage or inaccurate use of “validation” to refer to testing are likely due to the fact that the term is typically used in general communication as well as in medicine to refer to the testing of the accuracy of a completed algorithm [6, 20], while the field of machine learning uses it as a very specific term that refers to the tuning step [4,5,6, 12, 17, 19, 21]

Summary

Introduction

Deep learning (DL), often used almost synonymously with artificial intelligence (AI), is the most dominant type of machine learning technique at present. The real-world performance of a DL algorithm tested on adequate external datasets tends to be lower, often by large degrees, than those obtained with internal datasets during the tuning step [3, 6, 22,23,24]. Such mixed usage of the terminology may inadvertently exaggerate the performance of DL algorithms to researchers and the general public alike who are not familiar with machine learning.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Sep 11, 2020
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Inconsistency in the use of the term "validation" in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Inconsistency in the use of the term “validation” in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging
Hye Young Jang ... Seong Ho Park
-
Hye Young Jang, et. al.Hye Young Jang ... Seong Ho Park
11 Sep 2020
11 Sep 2020

Deep leaning in food safety and authenticity detection: An integrative review and future prospects
Yan Wang ... Yuanbin She
Trends in Food Science & Technology | VOL. 146
Yan Wang, et. al.Yan Wang ... Yuanbin She
21 Feb 2024
Trends in Food Science & Technology | VOL. 146

Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies
Myura Nagendran ... Mahiben Maruthappu
BMJ | VOL. 368
Myura Nagendran, et. al.Myura Nagendran ... Mahiben Maruthappu
25 Mar 2020
BMJ | VOL. 368

Performance of a vectorcardiographic deep learning algorithm compared to single-lead and 12-lead ECG for atrial flutter detection: implications for wearable devices
J Lampert ... P Zimetbaum
Europace | VOL. 26
J Lampert, et. al.J Lampert ... P Zimetbaum
24 May 2024
Europace | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Inconsistency in the use of the term "validation" in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE