A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.

Shuo Xu,Xin An,Yunliang Zhang,Haodong Zhang,Lijun Zhu

doi:10.1186/1758-2946-7-s1-s11

Abstract

BackgroundIn order to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text. The CHEMDNER challenge in BioCreative IV was specially designed to promote the implementation of corresponding systems that are able to detect mentions of chemical compounds and drugs, which has two subtasks: CDI (Chemical Document Indexing) and CEM.ResultsOur system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach), and post-processing (rule-based approach and format conversion). In our post-challenge system, the cost parameter in CRF model was optimized by 10-fold cross validation with grid search, and word representations feature induced by Brown clustering method was introduced. For the CEM subtask, our official runs were ranked in top position by obtaining maximum 88.79% precision, 69.08% recall and 77.70% balanced F-measure, which were improved further to 88.43% precision, 76.48% recall and 82.02% balanced F-measure in our post-challenge system.ConclusionsIn our system, instead of extracting a CEM as a whole, we regarded it as a sequence labeling problem. Though our current system has much room for improvement, our system is valuable in showing that the performance in term of balanced F-measure can be improved largely by utilizing large amounts of relatively inexpensive un-annotated PubMed abstracts and optimizing the cost parameter in CRF model. From our practice and lessons, if one directly utilizes some open-source natural language processing (NLP) toolkits, such as OpenNLP, Standford CoreNLP, false positive (FP) rate may be very high. It is better to develop some additional rules to minimize the FP rate if one does not want to re-train the related models. Our CEM recognition system is available at: http://www.SciTeMiner.org/XuShuo/Demo/CEM.

Highlights

In order to improve information access on chemical compounds and drugs described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text
In our system, instead of extracting a CEM as a whole, we regarded it as a sequence labeling problem
Though our current system has much room for improvement, our system is valuable in showing that the performance in term of balanced F-measure can be improved largely by utilizing large amounts of relatively inexpensive un-annotated PubMed abstracts and optimizing the cost parameter in conditional random field (CRF) model

Summary

Introduction

In order to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text. The CHEMDNER challenge in BioCreative IV was specially designed to promote the implementation of corresponding systems that are able to detect mentions of chemical compounds and drugs, which has two subtasks: CDI (Chemical Document Indexing) and CEM. There is an increasing interest to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, including scientific articles, patents, health agency reports, or the Web [1]. The chemical compound and drug named entity recognition (CHEMDNER) challenge in BioCreative IV was specially designed to promote the implementation of systems that are able to detect mentions of chemical compounds and drugs, which has two subtasks, CDI

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of cheminformatics	Publication Date: Jan 19, 2015
Citations: 31	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of cheminformatics

Lead the way for us

Similar Papers

Grammar rule-based sentiment categorisation model for classification of Tamil tweets
Nadana Ravishankar ... R Shriram
International Journal of Intelligent Systems Technologies and Applications | VOL. 17
Nadana Ravishankar, et. al.Nadana Ravishankar ... R Shriram
01 Jan 2018
International Journal of Intelligent Systems Technologies and Applications | VOL. 17

Re-ranking of spoken term detections using CRF-based triphone detection models
Naoki Sawada ... Hiromitsu Nishizaki
-
Naoki Sawada, et. al.Naoki Sawada ... Hiromitsu Nishizaki
01 Dec 2014
01 Dec 2014

Research on Tag Method for Joint Extraction of Domain-oriented Entity and Relation
Yuxin Shi ... Ailian Zhou
-
Yuxin Shi, et. al.Yuxin Shi ... Ailian Zhou
29 Oct 2021
29 Oct 2021

Grammar rule-based sentiment categorisation model for classification of Tamil tweets
Nadana Ravishankar ... R. Shriram
International Journal of Intelligent Systems Technologies and Applications | VOL. 17
Nadana Ravishankar, et. al.Nadana Ravishankar ... R. Shriram
01 Jan 2018
International Journal of Intelligent Systems Technologies and Applications | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of cheminformatics