Relevant Word Order Vectorization for Improved Natural Language Processing in Electronic Health Records

Jeffrey Thompson,Jinxiang Hu,Devin C Koestler,Lisa Neums,Matthew S Mayo,Byron Gajewski,Dinesh Pal Mudaranthakam,Michele Park,David Streeter,Roy Jensen

doi:10.1038/s41598-019-45705-y

Jeffrey Thompson, Jinxiang Hu + Show 8 more

Open Access

https://doi.org/10.1038/s41598-019-45705-y

Copy DOI

Abstract

Electronic health records (EHR) represent a rich resource for conducting observational studies, supporting clinical trials, and more. However, much of the data contains unstructured text, presenting an obstacle to automated extraction. Natural language processing (NLP) can structure and learn from text, but NLP algorithms were not designed for the unique characteristics of EHR. Here, we propose Relevant Word Order Vectorization (RWOV) to aid with structuring. RWOV is based on finding the positional relationship between the most relevant words to predicting the class of a text. This facilitates machine learning algorithms to use the interaction of not just keywords but positional dependencies (e.g. a relevant word occurs 5 relevant words before some term of interest). As a proof-of-concept, we attempted to classify the hormone receptor status of breast cancer patients treated at the University of Kansas Medical Center, comparing RWOV to other methods using the F1 score and AUC. RWOV performed as well as, or better than other methods in all but one case. For F1 score, RWOV had a clear edge on most tasks. AUC tended to be closer, but for HER2, RWOV was significantly better for most comparisons. These results suggest RWOV should be further developed for EHR-related NLP.

Highlights

The biggest challenge in the use of Electronic health records (EHR) comes from extensive reliance on unstructured data
We focus on the data structuring part of the Natural language processing (NLP) problem, we will pair our vectorization-based approach with a couple of different machine learning algorithms to compare its performance to existing methods
Our goal is to identify the status of three important breast cancer biomarkers from the pathology report free text: estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2)

Summary

Introduction

The biggest challenge in the use of EHR comes from extensive reliance on unstructured data. Given the importance of clinical trials to drug development for a range of conditions, there is a critical need for methods that can automate and improve the efficiency of this process. This led many researchers to propose the application of natural language processing (NLP) techniques to these data. NLP is not a specific method but rather a collection of approaches that involve extracting information www.nature.com/scientificreports/. As part of our own work to support research at the University of Kansas Cancer Center using the EHR13, we are investigating methods for using NLP to extract information from free text fields. This differs from other methods by considering only the relative position of specific terms, which are assumed to be predictive, to a target term, with extreme flexibility in the true distance between those terms

Objectives

Methods

Results

Conclusion