Abstract Purpose: De-identification of cancer imaging data is vitally important for data sharing and the advancement of research, however it is a time consuming and complex process that limits access to new cancer data sets such as those shared through NCI's Imaging Data Commons (IDC), built on the Google Cloud Platform (GCP). Our research demonstrates how this process can be automated using GCP-native services. Methods: We configured the Medical Image De-Identification (MIDI) pipeline to automate de-identification of cancer imaging data. De-identification is performed using an alpha release of GCP’s Healthcare API which was configured to scrub all Protected Health Information (PHI) from both Digital Imaging and Communications in Medicine (DICOM) headers and burnt-in text in pixel data. A dataset containing 216 patients and 23,921 images was prepared to test the de-identification algorithm by placing synthetic PHI in both DICOM headers and pixel data. The synthetic data matched real data seen during curation at The Cancer Imaging Archive (TCIA) and included data difficult for an algorithm to detect. Accuracy of the MIDI pipeline was measured against TCIA’s standard tools and procedures for de-identification. Measures included correct detection of all PHI data and correct action taken (e.g., remove, encrypt, or otherwise obscure). Throughput was also measured. Results: Throughput was measured at 22.0 images per second over 10 runs. The MIDI pipeline’s accuracy for DICOM headers was 98.7%, accurately detecting dates, addresses, phone numbers, unique identifiers, names, and other common PHI. The most common PHI failed to remove were special cases that included uncommon names or names with symbols, dates in string data types that were mistaken for other IDs, patient IDs, and abbreviated institution names. Private Creator data elements were consistently failed to be retained. These errors were due to options not currently available, and algorithms not trained on specific PHI, such as abbreviated institution names. UIDs were correctly replaced. PHI burnt-in the pixel data was successfully detected and removed, with one false positive. Conclusion: We demonstrate the current capability and performance of automated cancer image de-identification. Our results show that while full automation is within grasp, a semi-automated pipeline is now feasible. A human expert in the loop can be used for final verification. This will lead to a much-needed acceleration of image de-identification, to handle the rapidly growing volume of data and provide rapid timely access in support of cancer research. Future work will focus on including pre- and post-processing tools to aid the human expert in the loop, such as identifying and flagging questionable images for manual review. These tools will also be used to catch the errors mentioned in results. Citation Format: Benjamin P. Kopchick, Laura K. Opsahl-Ong, Qinyan Pan, Michael W. Rutherford, Ulrike Wagner, Bhavani S. Singh, Scott Gustafson, Fred W. Prior, David A. Clunie, Juergen A. Klenk, Keyvan Farahani. Accelerating de-identification of images with cloud services to support data sharing in cancer research. [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 6579.
Read full abstract