Comparison of Neural Language Modeling Pipelines for Outcome Prediction From Unstructured Medical Text Notes

Cherubin Mugisha,Incheon Paik

doi:10.1109/access.2022.3148279

Abstract

Machine learning techniques and algorithm-based approaches are becoming more and more vital to support clinical decision-making. In the medical area, natural language processing (NLP) techniques have shown the ability to extract useful information from electronic health records. On the one hand, statistic, semantic, and contextualized word embedding-based models and on the other hand preprocessing approaches are the keys to a better representation of a document. Using narratives from the Intensive Care Unit, we elaborated a comparison of the most used methods and preprocessing approaches to tackle an outcome prediction problem and guide researchers into NLP pipelines in the medical area. We used real data from Medical Information Mart for Intensive Care-III (MIMIC-III). We selected all notes related to patients with pneumonia. We conducted a deep analysis on text preprocessing tasks producing three datasets: raw data with minor preprocessing, meticulous preprocessing, and extreme preprocessing filtering only medical-related terminologies using Named Entity Recognition algorithms. We then used these three sets in five models, of which two are based on the traditional noncontextual word embedding techniques and three use contextualized word embedding based on a transformer. We demonstrated that transformer-based models outperform other word embedding models and a profound preprocessing yielded an accuracy of 98.2 F1-score. These results show the highly competitive ability of NLP predictive models against other models that use medical data. With an appropriate NLP pipeline, the information contained in medical narratives can be used to draw up a patient profile, and admission notes can help to ascertain a mortality risk of a patient admitted to the Intensive Care Unit.

Highlights

P NEUMONIA is an infectious disease of the lungs affecting alveoli and caused by bacteria, fungi, or viruses
The question of this research is what combination of natural language processing (NLP) models and preprocessing methods are appropriate to unlock the information from medical narratives.The present study aims to use medical pneumonia patients notes, written by a multidisciplinary team of care-providers, to investigate among the dynamic word embeddings and static models to assess and compare their performance on the outcome prediction of an Intensive Care Unit (ICU) hospitalization
We present a deep comparison of Neural Language modeling pipelines for outcome prediction from medical text notes using pneumonia patients

Summary

Introduction

P NEUMONIA is an infectious disease of the lungs affecting alveoli and caused by bacteria, fungi, or viruses. Pneumonia can range in seriousness from mild to lifethreatening. It remains the commonest infective reason for admission to intensive care as well as being the most common secondary infection acquired while in the Intensive Care Unit (ICU) [1], [2]. EHR systems contain structured data such as demographics, vital signs, laboratory test results, medications, and procedures. They have unstructured medical or nonmedical data in a free format such as imaging reports or care-provider notes [3]. It is common and practical to use all types of data to understand the status of a patient or to predict his outcome.

Objectives

Methods

Results

Discussion

Conclusion