Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Saber A Akhondi,Kristina M Hettne,Eelke Van Der Horst,Jan A Kors,Erik M Van Mulligen

doi:10.1186/1758-2946-7-s1-s10

Saber A Akhondi, Kristina M Hettne + Show 3 more

Open Access

https://doi.org/10.1186/1758-2946-7-s1-s10

Copy DOI

Abstract

BackgroundThe past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals.ResultsThe use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions.ConclusionsWe developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.

Highlights

The past decade has seen an upsurge in the number of publications in chemistry
We first concentrated on the CEM subtask where we carried out chemical entity mention recognition
Named entity recognition was performed with case sensitive matching

Summary

Introduction

The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The drawback of a dictionary approach is that it is nearly impossible to include all systematic chemical identifiers, such as IUPAC names [4] or SMILES [5], which are algorithmically generated based on the structure of the chemical compound and follow a specific grammar [6]. These predefined grammars are sets of rules or guidelines developed to refer to a compound with a unique textual representation (systematic term or identifier). The drawback of machine learning approaches is the need for a sufficiently large annotated corpus for training the system

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cheminformatics	Publication Date: Jan 19, 2015
Citations: 64	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.
Saber A Akhondi ... Zubair Afzal
Database | VOL. 2016
Saber A Akhondi, et. al.Saber A Akhondi ... Zubair Afzal
01 Jan 2015
Database | VOL. 2016

Statistical Character-Based Syntax Similarity Measurement for Detecting Biomedical Syntax Variations through Named Entity Recognition
Hossein Tohidi ... Masrah Azrifan Azmi
-
Hossein Tohidi, et. al.Hossein Tohidi ... Masrah Azrifan Azmi
01 Jan 2010
01 Jan 2010

Application of BiLSTM-CRF model with different embeddings for product name extraction in unstructured Turkish text
Serdar Arslan
Neural Computing and Applications | VOL. 36
Serdar ArslanSerdar Arslan
21 Feb 2024
Neural Computing and Applications | VOL. 36

Recognition of Chemical Entities using Pattern Matching and Functional Group Classification
R Hema ... T V Geetha
International Journal of Intelligent Information Technologies | VOL. 12
R Hema, et. al.R Hema ... T V Geetha
01 Oct 2016
International Journal of Intelligent Information Technologies | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics