Chemical Entity Recognition Research Articles

The automatic recognition of chemical names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The task is even more challenging when considering the identification of these entities in the article's full text and, furthermore, the identification of candidate substances for that article's metadata [Medical Subject Heading (MeSH) article indexing]. The National Library of Medicine (NLM)-Chem track at BioCreative VII aimed to foster the development of algorithms that can predict with high quality the chemical entities in the biomedical literature and further identify the chemical substances that are candidates for article indexing. As a result of this challenge, the NLM-Chem track produced two comprehensive, manually curated corpora annotated with chemical entities and indexed with chemical substances: the chemical identification corpus and the chemical indexing corpus. The NLM-Chem BioCreative VII (NLM-Chem-BC7) Chemical Identification corpus consists of 204 full-text PubMed Central (PMC) articles, fully annotated for chemical entities by 12 NLM indexers for both span (i.e.named entity recognition) and normalization (i.e.entity linking) using MeSH. This resource was used for the training and testing of the Chemical Identification task to evaluate the accuracy of algorithms in predicting chemicals mentioned in recently published full-text articles. The NLM-Chem-BC7 Chemical Indexing corpus consists of 1333 recently published PMC articles, equipped with chemical substance indexing by manual experts at the NLM. This resource was used for the evaluation of the Chemical Indexing task, which evaluated the accuracy of algorithms in predicting the chemicals that should be indexed, i.e.appear in the listing of MeSH terms for the document. This set was further enriched after the challenge in two ways: (i) 11 NLM indexers manually verified each of the candidate terms appearing in the prediction results of the challenge participants, but not in the MeSH indexing, and the chemical indexing terms appearing in the MeSH indexing list, but not in the prediction results, and (ii) the challenge organizers algorithmically merged the chemical entity annotations in the full text for all predicted chemical entities and used a statistical approach to keep those with the highest degree of confidence. As a result, the NLM-Chem-BC7 Chemical Indexing corpus is a gold-standard corpus for chemical indexing of journal articles and a silver-standard corpus for chemical entity identification in full-text journal articles. Together, these resources are currently the most comprehensive resources for chemical entity recognition, and we demonstrate improvements in the chemical entity recognition algorithms. We detail the characteristics of these novel resources and make them available for the community. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/NLM-Chem-BC7-corpus/.

Read full abstract

BackgroundThe functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound and Drug Named Entity Recognition (CHEMDNER) task to establish a standard dataset for evaluating state-of-the-art chemical entity recognition methods.MethodsThis study introduces the approach of our CHEMDNER system. Instead of emphasizing the development of novel feature sets for machine learning, this study investigates the effect of various tag schemes on the recognition of the names of chemicals and drugs by using conditional random fields. Experiments were conducted using combinations of different tokenization strategies and tag schemes to investigate the effects of tag set selection and tokenization method on the CHEMDNER task.ResultsThis study presents the performance of CHEMDNER of three more representative tag schemes-IOBE, IOBES, and IOB12E-when applied to a widely utilized IOB tag set and combined with the coarse-/fine-grained tokenization methods. The experimental results thus reveal that the fine-grained tokenization strategy performance best in terms of precision, recall and F-scores when the IOBES tag set was utilized. The IOBES model with fine-grained tokenization yielded the best-F-scores in the six chemical entity categories other than the "Multiple" entity category. Nonetheless, no significant improvement was observed when a more representative tag schemes was used with the coarse or fine-grained tokenization rules. The best F-scores that were achieved using the developed system on the test dataset of the CHEMDNER task were 0.833 and 0.815 for the chemical documents indexing and the chemical entity mention recognition tasks, respectively.ConclusionsThe results herein highlight the importance of tag set selection and the use of different tokenization strategies. Fine-grained tokenization combined with the tag set IOBES most effectively recognizes chemical and drug names. To the best of the authors' knowledge, this investigation is the first comprehensive investigation use of various tag set schemes combined with different tokenization strategies for the recognition of chemical entities.

Read full abstract

Chemical Entity Recognition Research Articles

Related Topics

Articles published on Chemical Entity Recognition

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.

Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space.

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature

Nanomaterial Synthesis Insights from Machine Learning of Scientific Articles by Extracting, Structuring, and Visualizing Knowledge.

PharmacoNER Tagger: a deep learning-based tool for automatically finding chemicals and drugs in Spanish medical texts.

Information Retrieval and Text Mining Technologies for Chemistry.

Recognition of Chemical Entities using Pattern Matching and Functional Group Classification

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.

Improving chemical entity recognition through h-index based semantic similarity.

Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics.

Chemical named entities recognition: a review on approaches and applications.

Chemical Entity Recognition and Resolution to ChEBI.

Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Chemical Entity Recognition Research Articles

Related Topics

Articles published on Chemical Entity Recognition

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.

Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space.

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature

Nanomaterial Synthesis Insights from Machine Learning of Scientific Articles by Extracting, Structuring, and Visualizing Knowledge.

PharmacoNER Tagger: a deep learning-based tool for automatically finding chemicals and drugs in Spanish medical texts.

Information Retrieval and Text Mining Technologies for Chemistry.

Recognition of Chemical Entities using Pattern Matching and Functional Group Classification

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.

Improving chemical entity recognition through h-index based semantic similarity.

Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics.

Chemical named entities recognition: a review on approaches and applications.

Chemical Entity Recognition and Resolution to ChEBI.

Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.