Predictive Model for Risk of 30-Day Rehospitalization Using a Natural Language Processing/Machine Learning Approach Among Medicare Patients with Heart Failure

Youjeong Kang,John Hurdle

doi:10.1016/j.cardfail.2020.09.023

Abstract

IntroductionNearly 80% of all patients with heart failure (HF) are older adults (≥65 years of age). Prior studies have built predictive models that relied on structured data from electronic health records (EHRs) to predict the risk of 30-day rehospitalization for patients with HF. Structured data mostly included simple vocabularies such as age, and ethnicity. Rarely do prior studies include clinical narrative data in a free-text format (i.e., unstructured data). No previous study has focused on using clinical narrative notes specifically for Medicare patients with HF in the acute-care setting.AimTo identify clinical notes for building a predictive model for risk of 30-day rehospitalization among Medicate patients with HF.MethodsThis study first used free-text discharge summary notes and nursing care plans collected from June 1, 2015 to December 31, 2019, for a randomly selected 500 Medicare patients with HF. Natural Language Processing (NLP): we iterated over standard text pre-processing steps, exploring the impact of n-gram length, term document-frequency, word stemming, and the added value of parts-of-speech. We chose two models: 1) the classification model called Bag-of Words (BOW), where each document is represented by a vector based on the pre-processed text, and 2) Document Embedding, where document terms are mapped to a dimension-reducing layer (length equals 300). The latter runs exceptionally fast (on the order of tens-of-seconds for 2,000 documents). Machine Learning (ML): the output of the NLP BOW and Document Embedding models were fed to six different conventional machine learning systems (logistic regression, support vector machine, random forest, k-nearest neighbor clustering, a three-layer neural network, and Naïve Bayes).ResultsThe mean age was 77±7.9, and the average of length of hospital stay was 4.9 days ± 4.8. The best BOW model we found using discharge summaries (n=387) produced an Area Under the Receiver Operating Characteristics Curve (AUC) of 0.71 and F1 score of 0.65. The best Document Embedding model yielded an AUC of 0.65 and an F1 score of 0.61. Using nursing care notes as the unit of analysis (n = 2,046), the NLM/ML performed far better. The best BOW model on care plans found an AUC of 0.85 and F1 score of 0.77. The best Document Embedding produced an AUC of 0.83 and an F1 score of 0.75. In all cases we held out 33% of the data set for validation, repeating a random draw 10 times and averaging the performance results.ConclusionsWe conclude that nursing care plans are a better predictor of 30-day rehospitalization risk than discharge summaries. Because nursing care plans are shorter than discharge summaries, they have the added advantage of faster processing. Since the faster Document Embedding model's performance is comparable to that of BOW, we suggest its use in future work in the area of 30-day rehospitalization risk in Medicare patients with HF.

Full Text