Abstract

Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We show that such a model achieves superior results on clinical extraction tasks by comparing our entity-centric masking strategy with classic random masking on three clinical NLP tasks: cross-domain negation detection, document time relation (DocTimeRel) classification, and temporal relation extraction. We also evaluate our models on the PubMedQA dataset to measure the models’ performance on a non-entity-centric task in the biomedical domain. The language addressed in this work is English.

Highlights

  • Introduction cabulary as we observed thatPubMedBERT kept 30% more in-domain words in its vocabulary thanTransformer-based neural language models, such as BERT (Devlin et al, 2018), have achieved state-of-the-art performance for a variety of nat-BERT

  • Since most PubMedBERT appears to provide a vocabare pre-trained on large general domain corpora, ulary that is helpful to the clinical domain

  • Howmany efforts have been made to continue pre- ever, the language of biomedical literature is diftaining general-domain language models on clini- ferent from the language of the clinical documents cal/biomedical corpora to derive domain-specific found in electronic medical records (EMRs)

Read more

Summary

Methods

2001; Harkema et al, 2009; Mehrabi et al, 2015), clinical relation discovery extracts relations among clinical entities (Lv et al, 2016; Leeuwenberg and Moens, 2017), etc. We first describe our clinical text datasets and related NLP tasks, the details of our entity-centric masking strategy, and the settings we used for both pretraining and fine-tuning. Besides transformer-based models, there are other efforts (Beam et al, 2019; Chen et al, 2020) to characterize the biomedical/clinical entities at the word embedding level. We do not include these efforts in our discussion because the focus of our paper is the investigation of a novel entity-based masking strategy in a transformer-based setting. We propose a methodology to produce a model focused on clinical entities: continued pretraining of a model with a broad representation of biomedical terminology (the PubMedBERT model) on a clinical corpus, along with a novel

Transformer models
Unlabeled Pre-training Data
Labeled Fine-tuning Data
Findings
Settings

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.