Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models.

Xi Yang,Yonghui Wu,Hansi Zhang,Jiang Bian,Xing He

doi:10.2196/22982

Abstract

BackgroundPatients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction.ObjectiveThis study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand.MethodsWe developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification.ResultsOur system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge.ConclusionsThis study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.

Highlights

Patients’ family history (FH) is a critical risk factor associated with numerous diseases [1,2,3] such as diabetes [4], coronary heart disease [5], and multiple types of cancers [6,7,8,9]
In the named entity recognition (NER) module, we explored state-of-the-art Natural language processing (NLP) models, including long short-term memory n2c2 (LSTM)-conditional random fields FH (CRFs) and bidirectional encoder representations from transformers (BERT) to identify FH concepts
We further explored the BERT model for NER and the combination of BERT-ner-EN, BERT-cls, and BERT-rel achieved better F1 scores of 0.8249 and 0.6775 for the 2 subtasks, respectively

Summary

Introduction

Patients’ family history (FH) is a critical risk factor associated with numerous diseases [1,2,3] such as diabetes [4], coronary heart disease [5], and multiple types of cancers [6,7,8,9]. Extracting patients’ FH information is a labor-intensive and time-consuming procedure that cannot be scaled up. Natural language processing (NLP) is the key technology to build automated computational models http://medinform.jmir.org/2020/12/e22982/. Patients’ family history (FH) is a critical risk factor associated with numerous diseases. FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction

Methods

Results

Discussion

Conclusion