The Trials and Tribulations of Assembling Large Medical Imaging Datasets for Machine Learning Applications

Kirti Magudia,Michael H Rosenthal,Katherine P Andriole,Christopher P Bridge

doi:10.1007/s10278-021-00505-7

Kirti Magudia, Michael H Rosenthal + Show 2 more

Open Access

https://doi.org/10.1007/s10278-021-00505-7

Copy DOI

Abstract

With vast interest in machine learning applications, more investigators are proposing to assemble large datasets for machine learning applications. We aim to delineate multiple possible roadblocks to exam retrieval that may present themselves and lead to significant time delays. This HIPAA-compliant, institutional review board–approved, retrospective clinical study required identification and retrieval of all outpatient and emergency patients undergoing abdominal and pelvic computed tomography (CT) at three affiliated hospitals in the year 2012. If a patient had multiple abdominal CT exams, the first exam was selected for retrieval (n=23,186). Our experience in attempting to retrieve 23,186 abdominal CT exams yielded 22,852 valid CT abdomen/pelvis exams and identified four major categories of challenges when retrieving large datasets: cohort selection and processing, retrieving DICOM exam files from PACS, data storage, and non-recoverable failures. The retrieval took 3 months of project time and at minimum 300 person-hours of time between the primary investigator (a radiologist), a data scientist, and a software engineer. Exam selection and retrieval may take significantly longer than planned. We share our experience so that other investigators can anticipate and plan for these challenges. We also hope to help institutions better understand the demands that may be placed on their infrastructure by large-scale medical imaging machine learning projects.

Highlights

Machine learning is a field focusing on how computers can learn from data and sits at the intersection between statistics and computer science
An initial attempt to retrieve our cohort revealed that a number of studies that we had identified for retrieval were mislabeled musculoskeletal and interventional computed tomography (CT) exams
Another major initial challenge for exam retrieval was inconsistent formatting of medical record numbers (MRNs) and accession numbers (ACCs) across different hospitals

Summary

Introduction

Machine learning is a field focusing on how computers can learn from data and sits at the intersection between statistics and computer science. An increasingly popular approach to machine learning is to use deep neural networks, inspired by the structure and function of the human brain to process complex image data [1]. A major bottleneck to the potential progress of machine learning in radiology is the assembly of imaging datasets to use for model training [3]. Performance of these models generally improves with more data so maximal dataset size is desired [1]. Increasing numbers of investigators are proposing to assemble their own datasets for training,

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Digital Imaging	Publication Date: Oct 4, 2021
Citations: 11	License type: open-access

R Discovery Prime

R Discovery Prime

The Trials and Tribulations of Assembling Large Medical Imaging Datasets for Machine Learning Applications

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Digital Imaging

Lead the way for us

Similar Papers

Machine Learning Applications in Orthopaedic Imaging.
Vincent M Wang ... Bert Huang
The Journal of the American Academy of Orthopaedic Surgeons | VOL. 28
Vincent M Wang, et. al.Vincent M Wang ... Bert Huang
15 May 2020
The Journal of the American Academy of Orthopaedic Surgeons | VOL. 28

Chemist versus Machine: Traditional Knowledge versus Machine Learning Techniques
Janine George ... Geoffroy Hautier
Trends in Chemistry | VOL. 3
Janine George, et. al.Janine George ... Geoffroy Hautier
09 Nov 2020
Trends in Chemistry | VOL. 3

Machine learning in pain research.
Jörn Lötsch ... Alfred Ultsch
Pain | VOL. 159
Jörn Lötsch, et. al.Jörn Lötsch ... Alfred Ultsch
24 Nov 2017
Pain | VOL. 159

Topic modeling for cluster analysis of large biological and medical datasets.
Weizhong Zhao ... James J Chen
BMC Bioinformatics | VOL. Suppl 15 11
Weizhong Zhao, et. al.Weizhong Zhao ... James J Chen
21 Oct 2014
BMC Bioinformatics | VOL. Suppl 15 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Trials and Tribulations of Assembling Large Medical Imaging Datasets for Machine Learning Applications

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Digital Imaging