Biomedical Named Entity Recognition with less Supervision

Rohit J Kate,Omid Ghiasvand

doi:10.1109/ichi.2015.85

Abstract

Annotating clinical notes manually is very labor-intensive and needs expertise in the area of annotation. Thus annotation is a highly expensive task not only in human resource but also in financial aspects. Moreover mistakes, missed tags, and inconsistency are the common problems with manual annotations. The purpose of this research is to reduce humans as annotation effort for clinical notes, to improve consistency, and to decrease cost of annotation. The aim of this research is to annotate clinical texts to extract biomedical names and terms. In our research Unified Medical Language System (UMLS) is the reference meta thesaurus of names and terms used in biomedical and clinical domains. In this research we have done unsupervised and semi-supervised Named Entity Recognition (NER) through exact matching in UMLS. The data sets that have been used were provided by SemEval 2015 (task 14) natural language processing competition, including 199 clinical notes in training set and 133 notes in test set. The analysis that has been done so far can be divided into two steps: mapping and learning. The first step is to map all terms into UMLS that includes not only unigrams but also n-grams, usually n is 5. To achieve the best results of exact matching, we extracted UMLS terms of diseases and disorders based on semantic groups and mapped each n-gram to that part of UMLS. If there is a match, that is assumed to be a disease or disorder. When there is no match for n-grams (n>=2), to avoid low precisions, we supposed that unigrams must be noun phrases to be nominated as a disease/disorder. With this method we got 60% of f-score, and training files for next process (training CRFs) were generated. The second step involves using Conditional Random Fields (CRFs). The results generated in the first step were used to train the CRF. CRFs learn from training data the general contexts in which named entities occur. Also because of different levels of correctness in training files, we decided to modify training files before using them to train CRFs and to test on test data. Level of correctness means different accuracies of tagging in the data set. Because exact matching is not very accurate, the accuracy in different notes is variable. In some data it is very high and in some of them it is low. This results in an inconsistency in training files. To solve this problem we divided training files into ten groups. The CRF used only one group to be trained and to tag other groups, and results of exact matches and CRFs were combined (logic OR between results of CRF and exact match) together to get the final results. This was done for all other groups as well, and finally applied on test data. These two steps together are known as unsupervised disease named entity recognition, and the results show a difference of 10.3 percent between unsupervised and supervised approaches. By supervised learning we got 73% F-score while we got 62.7% by the proposed unsupervised approach. Another approach that was developed is semi supervised disease named entity recognition that used annotated files generated by unsupervised method and annotated files by human or gold standards. By this method we could improve 73% of F-score, that we got in supervised approach, to 74.2%. In the future some other refinements and extra tasks are going to be done. To improve the results, we are planning to use approximate matching by the process that is called normalization. Normalization means mapping a term in clinical notes to a preferred term in UMLS. These kinds of terms do not have exact matches, thus the way to find exact matches is to use normalization. Moreover we are going to do exact/approximate matching over discontinuous mentions in clinical texts. In these texts there are mentions including disconnected words in a sentence that together form a named entity. This essential step will extract those mentions that could not be extracted by exact match and normalization approaches. The last thing in our plan is to expand our developed system to a less supervised Biomedical Named Entity Recognition (BNER) to extract all biomedical and clinical terms. We will do this for other semantic groups in UMLS such as Activities and Behaviors, Anatomy, Devices, Phenomena, etc. Thus developing a less supervised annotating system for clinical notes could generate annotated notes with less cost of manual tagging, more consistent, and accurate enough. By using this approach it is feasible to extract tags of other semantic groups in UMLS, and finally it could be an advanced system to tag all the biomedical and clinical mentions based on semantic groups in UMLS.

Full Text