Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings

Zenan Zhai,Saber Akhondi,Trevor Cohn,Camilo Thorne,Karin Verspoor,Christian Druckenbrodt,Michelle Gregory,Dat Quoc Nguyen

doi:10.18653/v1/w19-5035

Abstract

Chemical patents are an important resource for chemical information. However, few chemical Named Entity Recognition (NER) systems have been evaluated on patent documents, due in part to their structural and linguistic complexity. In this paper, we explore the NER performance of a BiLSTM-CRF model utilising pre-trained word embeddings, character-level word representations and contextualized ELMo word representations for chemical patents. We compare word embeddings pre-trained on biomedical and chemical patent corpora. The effect of tokenizers optimized for the chemical domain on NER performance in chemical patents is also explored. The results on two patent corpora show that contextualized word representations generated from ELMo substantially improve chemical NER performance w.r.t. the current state-of-the-art. We also show that domain-specific resources such as word embeddings trained on chemical patents and chemical-specific tokenizers, have a positive impact on NER performance.

Highlights

Chemical patents are an important starting point for understanding of chemical compound purpose, properties, and novelty
New chemical compounds are often initially disclosed in patent documents; it may take 1-3 years for these chemicals to be mentioned in chemical literature (Senger et al, 2015), suggesting that patents are a valuable but underutilized resource
The results show that contextualized word representations help improve chemical Named-Entity Recognition (NER) performance substantially

Summary

Introduction

Chemical patents are an important starting point for understanding of chemical compound purpose, properties, and novelty. Authors strive to make their words as clear and straight-forward as possible, whereas patent authors often seek to protect their knowledge from being fully disclosed (Valentinuzzi, 2017). In tension with this is the need to claim broad scope for intellectual property reasons, and patents typically contain more details and are more exhaustive than scientific papers (Lupu et al, 2011). The results show that word embeddings that are pre-trained on chemical patents outperform embeddings pre-trained on biomedical datasets, and using tokenizers optimized for the chemical domain can improve NER performance in chemical patent corpora

Related work

Our empirical methodology

Dataset

Tokenizers

Models

Pre-trained word embeddings

Character-level representation

Implementation details

Main Results

Error Analysis

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 49	License type: cc-by

Similar Papers

Comparing general and specialized word embeddings for biomedical named entity recognition.
Rigo E Ramos-Vargas ... Sulema Torres-Ramos
PeerJ Computer Science | VOL. 7
Rigo E Ramos-Vargas, et. al.Rigo E Ramos-Vargas ... Sulema Torres-Ramos
18 Feb 2021
PeerJ Computer Science | VOL. 7

Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.
Tsendsuren Munkhdalai ... Keun Ho Ryu
Journal of Cheminformatics | VOL. 7
Tsendsuren Munkhdalai, et. al.Tsendsuren Munkhdalai ... Keun Ho Ryu
19 Jan 2015
Journal of Cheminformatics | VOL. 7

Shahmukhi named entity recognition by using contextualized word embeddings
Amina Tehseen ... Xiangjie Kong
Expert Systems with Applications | VOL. 229
Amina Tehseen, et. al.Amina Tehseen ... Xiangjie Kong
01 Nov 2023
Expert Systems with Applications | VOL. 229

Multi-attention deep neural network fusing character and word embedding for clinical and biomedical concept extraction
Shengyu Fan ... Yaping Yang
Information Sciences | VOL. 608
Shengyu Fan, et. al.Shengyu Fan ... Yaping Yang
03 Jul 2022
Information Sciences | VOL. 608

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings

Abstract

Highlights

Summary

Talk to us

Similar Papers