Abstract

We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents

Highlights

  • Exploration of the chemical and biological space covered by patents is essential in the early stages of activities in the field of medicinal chemistry [1]

  • PubChem contains >90% of the identifiers in ChEMBL, DrugBank and Target Database (TTD), the other databases are much less well covered by PubChem

  • The majority of identifiers in DrugBank is covered by NCGC Pharmaceutical Collection (NPC) and TTD, but the overlap between all other pairs of databases is relatively low

Read more

Summary

Introduction

Exploration of the chemical and biological space covered by patents is essential in the early stages of activities in the field of medicinal chemistry [1]. Patent information is manually extracted [5]. This process is laborious and expensive due to the length of chemical patent texts, which may take hundreds of pages, and their complexity (mixture of scientific, technical and legal language, typographical errors, optical character recognition errors, etc.). These problems are aggravated by the sheer number of medicinal chemistry patents [1, 6]. One of the impediments is that very few large annotated gold-standard corpora for algorithm training and testing are available [9]

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.