Abstract
BackgroundChemical and biomedical named entity recognition (NER) is an essential preprocessing task in natural language processing. The identification and extraction of named entities from scientific articles is also attracting increasing interest in many scientific disciplines. Locating chemical named entities in the literature is an essential step in chemical text mining pipelines for identifying chemical mentions, their properties, and relations as discussed in the literature. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of chemical named entities. For this purpose, we transform the task of NER into a sequence labeling problem. We present a series of sequence labeling systems that we used, adapted and optimized in our experiments for solving this task. To this end, we experiment with hyperparameter optimization. Finally, we present LSTMVoter, a two-stage application of recurrent neural networks that integrates the optimized sequence labelers from our study into a single ensemble classifier.ResultsWe introduce LSTMVoter, a bidirectional long short-term memory (LSTM) tagger that utilizes a conditional random field layer in conjunction with attention-based feature modeling. Our approach explores information about features that is modeled by means of an attention mechanism. LSTMVoter outperforms each extractor integrated by it in a series of experiments. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieves an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieves an F1-score of 89.01%.Availability and implementationData and code are available at https://github.com/texttechnologylab/LSTMVoter.
Highlights
In order to advance the fields of biological, chemical and biomedical research, it is important to stay on the cutting edge of research
This section presents the results of our experiments for the chemical named entity recognition on Chemical Entity Mention in Patents (CEMP) and CHEMDNER corpus
The results listed are those obtained after the hyperparameter optimization described in the methods section, which were trained, optimized and tested on the corpora described
Summary
In order to advance the fields of biological, chemical and biomedical research, it is important to stay on the cutting edge of research. In order to avoid repetition and to contribute at least at the level of current research, researchers rely on published information to inform themselves about the Hemati and Mehler J Cheminform (2019) 11:3 associations with toxicological endpoints or the investigation of information on metabolic reactions can be carried out For these reasons, NLP initiatives have been launched in recent years to address the challenges of identifying biological, chemical and biomedical entities. BioCreative is a “Challenge Evaluation”, in which the participants are given defined text mining or information extraction tasks in the biomedical and chemical field. We describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of chemical named entities For this purpose, we transform the task of NER into a sequence labeling problem. We present LSTMVoter, a two-stage application of recurrent neural networks that integrates the optimized sequence labelers from our study into a single ensemble classifier
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have