Exploring Methods to Understand Cancer Disparities Using Natural Language Processing of Clinical Notes

A Derton,A Murray,D Liu,R.H Mak,T.A Miller,G.K Savova,D.S Bitterman

doi:10.1016/j.ijrobp.2022.07.369

Abstract

<h3>Purpose/Objective(s)</h3> There is an unmet need to understand drivers of cancer disparities, such as social determinants of health (SDOH) and implicit bias. But these data are largely locked as free text in the clinical narrative and not directly analyzable, and there are no validated methods to automatically extract these data. We explore natural language processing (NLP) methods to extract and identity factors that may underlie disparities. <h3>Materials/Methods</h3> Our cohort consisted of 861 patients treated with radiotherapy for a thoracic malignancy from 2014-2020. Clinic notes created during patients' radiotherapy course were collected and preprocessed; punctuation, digits, stop words, and person names removed (spaCy), and lists of lemmatized words created to compare word distributions across patient race. Disease/chemical entities (‘medical words') were identified using the scispCy en_ner_bc5cdr_md model. Log odds ratios with a Dirichlet prior, using the entire corpus as background, were used to compare word distributions between White patients and each other race for medical and non-medical words. For both medical and non-medical vocabularies, words most representative of text from each race (up to max 100 words) were categorized by a physician with expertise in NLP. <h3>Results</h3> 91.1% of patients were White, 2.7% Black, 0.7% Hispanic, 2.2% Asian, 0.9% other, 2.4% unknown. 19,675 notes were collected (White: 17,742, Black: 651 Hispanic: 182, Asian: 571, other: 140, unknown: 389). Among medical words, words related to substance use were overrepresented in non-White races. The table shows over-represented words in semantic categories of interest among non-medical words. <h3>Conclusion</h3> NLP extracted factors traditionally associated with poorer healthcare access, and emotion/affect words that may reflect provider bias. To the best of our knowledge, this is the first effort exploring NLP methods to identify factors associated with disparities. Understanding differences in clinical text across race will be crucial to developing higher-level NLP models that address, instead of amplify, disparities.

Full Text