Abstract

BackgroundThe volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain.MethodsIn this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture.ResultsSeveral Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80% . For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset.ConclusionThese results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages.

Highlights

  • The volume of biomedical literature and clinical data is growing at an exponential rate

  • In this work, we propose a system for biomedical concept detection such as chemical compounds, drugs, disorders, chemicals, genes, and proteins in clinical narrative and biomedical texts written in Spanish and English

  • We proved that as in other tasks such as Named Entity Recognition (NER), the use of dense representation of words such as word-level, character-level, and sense no-contextualized and contextualized representations are helpful for named entity recognition

Read more

Summary

Introduction

The volume of biomedical literature and clinical data is growing at an exponential rate. Efficient access to data described in unstructured biomedi‐ cal texts is a crucial task for the biomedical industry and research. Efficient information extraction off biomedical data described in scientific articles, clinical narrative, or e-health reports is a growing interest in biomedical industry, research, and so forth. In this context, improved biomedical name mentions identification in the biomedical texts is a crucial step downstream tasks such as drug and protein interactions, chemical compounds, adverse drug reactions, among others. The existing biomedical NER methods can be classified into: dictionary-based methods, which are based on the use of existing domain knowledge dictionaries limited by its size, spelling errors, the use of synonyms, and the constant growth of vocabulary. Deep learning approaches have emerged due to the availability of myriad data from different sources (scientific literature, social media, clinical texts, etc.)

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.