Abstract

Although the bulk of patient-generated health data are increasing exponentially, their use is impeded because most data come in unstructured format, namely as free-text clinical reports. A variety of natural language processing (NLP) methods have emerged to automate the processing of free text ranging from statistical to deep learning-based models; however, the optimal approach for medical text analysis remains to be determined. The aim of this study was to provide a head-to-head comparison of novel NLP techniques and inform future studies about their utility for automated medical text analysis. Magnetic resonance imaging reports of patients with brain metastases treated in two tertiary centers were retrieved and manually annotated using a binary classification (single metastasis v two or more metastases). Multiple bag-of-words and sequence-based NLP models were developed and compared after randomly splitting the annotated reports into training and test sets in an 80:20 ratio. A total of 1,479 radiology reports of patients diagnosed with brain metastases were retrieved. The least absolute shrinkage and selection operator (LASSO) regression model demonstrated the best overall performance on the hold-out test set with an area under the receiver operating characteristic curve of 0.92 (95% CI, 0.89 to 0.94), accuracy of 83% (95% CI, 80% to 87%), calibration intercept of -0.06 (95% CI, -0.14 to 0.01), and calibration slope of 1.06 (95% CI, 0.95 to 1.17). Among various NLP techniques, the bag-of-words approach combined with a LASSO regression model demonstrated the best overall performance in extracting binary outcomes from free-text clinical reports. This study provides a framework for the development of machine learning-based NLP models as well as a clinical vignette of patients diagnosed with brain metastases.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call