Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Saber A Akhondi,Benedikt F.H Becker,Kristina M Hettne,Ewoud Pons,Herman Van Haagen,Erik M Van Mulligen,Jan A Kors,Zubair Afzal

doi:10.1093/database/baw061

Abstract

We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents

Highlights

Exploration of the chemical and biological space covered by patents is essential in the early stages of activities in the field of medicinal chemistry [1]
PubChem contains >90% of the identifiers in ChEMBL, DrugBank and Target Database (TTD), the other databases are much less well covered by PubChem
The majority of identifiers in DrugBank is covered by NCGC Pharmaceutical Collection (NPC) and TTD, but the overlap between all other pairs of databases is relatively low

Summary

Introduction

Exploration of the chemical and biological space covered by patents is essential in the early stages of activities in the field of medicinal chemistry [1]. Patent information is manually extracted [5]. This process is laborious and expensive due to the length of chemical patent texts, which may take hundreds of pages, and their complexity (mixture of scientific, technical and legal language, typographical errors, optical character recognition errors, etc.). These problems are aggravated by the sheer number of medicinal chemistry patents [1, 6]. One of the impediments is that very few large annotated gold-standard corpora for algorithm training and testing are available [9]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database	Publication Date: Jan 1, 2016
Citations: 23	License type: cc-by

R Discovery Prime

R Discovery Prime

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database

Lead the way for us

Similar Papers

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.
Buzhou Tang ... Min Jiang
Journal of Cheminformatics | VOL. 7
Buzhou Tang, et. al.Buzhou Tang ... Min Jiang
19 Jan 2015
Journal of Cheminformatics | VOL. 7

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.
Saber A Akhondi ... Jan A Kors
Journal of Cheminformatics | VOL. 7
Saber A Akhondi, et. al.Saber A Akhondi ... Jan A Kors
19 Jan 2015
Journal of Cheminformatics | VOL. 7

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.
Hong-Jie Dai ... Yung-Chun Chang
Journal of Cheminformatics | VOL. 7
Hong-Jie Dai, et. al.Hong-Jie Dai ... Yung-Chun Chang
19 Jan 2015
Journal of Cheminformatics | VOL. 7

Chemical Entity Recognition and Resolution to ChEBI.
Tiago Grego ... Catia Pesquita
ISRN bioinformatics | VOL. 2012
Tiago Grego, et. al.Tiago Grego ... Catia Pesquita
15 Feb 2012
ISRN bioinformatics | VOL. 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database