Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients

Tanmoy Paul,Abu Saleh Mohammad Mosa,Md Kamruz Zaman Rana,Teja Venkat Pavan Kotapati,Humayera Islam,Vasanthi Mandhadi,Preethi Aishwarya Tautam,Michael Barnes,Richard D Hammer,Vishakha Sharma,Yaswitha Jampani,Nitesh Singh

doi:10.3390/app12199976

Tanmoy Paul, Abu Saleh Mohammad Mosa + Show 10 more

Open Access

https://doi.org/10.3390/app12199976

Copy DOI

Abstract

The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.

Full Text