Research Goal-Driven Data Model and Harmonization for De-Identifying Patient Data in Radiomics.

Surajit Kundu,Syamantak Das,Soumendranath Ray,Indranil Mallick,Santam Chakraborty,Jayanta Mukhopadhyay,Moses Arunsingh,Sanjoy Chatterjee,Rimpa Basu Achari,Tapesh Bhattacharyya,Partha Pratim Das

doi:10.1007/s10278-021-00476-9

Surajit Kundu, Syamantak Das + Show 9 more

Open Access

https://doi.org/10.1007/s10278-021-00476-9

Copy DOI

Abstract

There are various efforts in de-identifying patient's radiation oncology data for their uses in the advancement of research in medicine. Though the task of de-identification needs to be defined in the context of research goals and objectives, existing systems lack the flexibility of modeling data and normalization of names of attributes for accomplishing them. In this work, we describe a de-identification process of radiation and clinical oncology data, which is guided by a data model and a schema of dynamically capturing domain ontology and normalization of terminologies, defined in tune with the research goals in this area. The radiological images are obtained in DICOM format. It consists of diagnostic, radiation therapy (RT) treatment planning, RT verification, and RT response images. During the DICOM de-identification, a few crucial pieces of information are taken about the dataset. The proposed model is generic in organizing information modeling in sync with the de-identification of a patient's clinical information. The treatment and clinical data are provided in the comma-separated values (CSV) format, which follows a predefined data structure. The de-identified data is harmonized throughout the entire process. We have presented four specific case studies on four different types of cancers, namely glioblastoma multiforme, head-neck, breast, and lung. We also present experimental validation on a few patients' data in these four areas. A few aspects are taken care of during de-identification, such as preservation of longitudinal date changes (LDC), incremental de-identification, referential data integrity between the clinical and image data, de-identified data harmonization, and transformation of the data to an underlined database schema.

Full Text