EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain

Chen Lin,Dmitriy Dligach,Steven Bethard,Guergana Savova,Timothy Miller

doi:10.18653/v1/2021.bionlp-1.21

Abstract

Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We show that such a model achieves superior results on clinical extraction tasks by comparing our entity-centric masking strategy with classic random masking on three clinical NLP tasks: cross-domain negation detection, document time relation (DocTimeRel) classification, and temporal relation extraction. We also evaluate our models on the PubMedQA dataset to measure the models’ performance on a non-entity-centric task in the biomedical domain. The language addressed in this work is English.

Highlights

Introduction cabulary as we observed thatPubMedBERT kept 30% more in-domain words in its vocabulary thanTransformer-based neural language models, such as BERT (Devlin et al, 2018), have achieved state-of-the-art performance for a variety of nat-BERT
Since most PubMedBERT appears to provide a vocabare pre-trained on large general domain corpora, ulary that is helpful to the clinical domain
Howmany efforts have been made to continue pre- ever, the language of biomedical literature is diftaining general-domain language models on clini- ferent from the language of the clinical documents cal/biomedical corpora to derive domain-specific found in electronic medical records (EMRs)

Summary

Methods

2001; Harkema et al, 2009; Mehrabi et al, 2015), clinical relation discovery extracts relations among clinical entities (Lv et al, 2016; Leeuwenberg and Moens, 2017), etc. We first describe our clinical text datasets and related NLP tasks, the details of our entity-centric masking strategy, and the settings we used for both pretraining and fine-tuning. Besides transformer-based models, there are other efforts (Beam et al, 2019; Chen et al, 2020) to characterize the biomedical/clinical entities at the word embedding level. We do not include these efforts in our discussion because the focus of our paper is the investigation of a novel entity-based masking strategy in a transformer-based setting. We propose a methodology to produce a model focused on clinical entities: continued pretraining of a model with a broad representation of biomedical terminology (the PubMedBERT model) on a clinical corpus, along with a novel

Transformer models

Unlabeled Pre-training Data

Labeled Fine-tuning Data

Findings

Settings

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 8	License type: cc-by

Similar Papers

GREEK-BERT: The Greeks visiting Sesame Street
John Koutsikakis ... Ilias Chalkidis
-
John Koutsikakis, et. al.John Koutsikakis ... Ilias Chalkidis
02 Sep 2020
02 Sep 2020

Investigating the Challenges of Temporal Relation Extraction from Clinical Text
Diana Galvan ... Koji Matsuda
-
Diana Galvan, et. al.Diana Galvan ... Koji Matsuda
01 Jan 2018
01 Jan 2018

Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models.
Xi Yang ... Yinghan Ma
JMIR Medical Informatics | VOL. 8
Xi Yang, et. al.Xi Yang ... Yinghan Ma
23 Nov 2020
JMIR Medical Informatics | VOL. 8

Comparison of BERT implementations for natural language processing of narrative medical documents
Alexander Turchin ... Marinka Zitnik
Informatics in Medicine Unlocked | VOL. 36
Alexander Turchin, et. al.Alexander Turchin ... Marinka Zitnik
30 Nov 2022
Informatics in Medicine Unlocked | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain

Abstract

Highlights

Summary

Talk to us

Similar Papers