Abstract

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.

Highlights

  • In the field of biomedical sciences, vast quantities of data are generated every year and are available as free texts for natural language processing (NLP) tasks

  • In the process of annotating biomedical named entities we focused on a subset of four types of entities that are defined based on four semantic groups comprised in the UMLS

  • The number of unique lemmas is relevant for calculating the average frequency of each lemma, which is 4.19, irrespective of the type of word, it is widely known that functional words are more frequent than content ones

Read more

Summary

Introduction

In the field of biomedical sciences, vast quantities of data are generated every year and are available as free texts for natural language processing (NLP) tasks. In order to automatically process the biomedical literature data, some manually annotated biomedical corpora were developed and used for supervised training of and for evaluating the systems. Gold standard corpora have been created for different types of tasks such as part-of-speech tagging [1], named entity recognition [2], relation extraction [3], event extraction [4]. A slightly increasing number of resources specific to this field have been created for languages other than English. Boytcheva et al [5] created a biomedical corpus which contains 6400 words, 2000 of them belonging to the Bulgarian medical terminology.

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call