Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases

Zhiwei Chen,Xiuwen Liu,Jiang Bian,Zhe He

doi:10.1186/s12911-018-0630-x

Zhiwei Chen, Xiuwen Liu + Show 2 more

Open Access

https://doi.org/10.1186/s12911-018-0630-x

Copy DOI

Abstract

BackgroundIn the past few years, neural word embeddings have been widely used in text mining. However, the vector representations of word embeddings mostly act as a black box in downstream applications using them, thereby limiting their interpretability. Even though word embeddings are able to capture semantic regularities in free text documents, it is not clear how different kinds of semantic relations are represented by word embeddings and how semantically-related terms can be retrieved from word embeddings.MethodsTo improve the transparency of word embeddings and the interpretability of the applications using them, in this study, we propose a novel approach for evaluating the semantic relations in word embeddings using external knowledge bases: Wikipedia, WordNet and Unified Medical Language System (UMLS). We trained multiple word embeddings using health-related articles in Wikipedia and then evaluated their performance in the analogy and semantic relation term retrieval tasks. We also assessed if the evaluation results depend on the domain of the textual corpora by comparing the embeddings of health-related Wikipedia articles with those of general Wikipedia articles.ResultsRegarding the retrieval of semantic relations, we were able to retrieve semanti. Meanwhile, the two popular word embedding approaches, Word2vec and GloVe, obtained comparable results on both the analogy retrieval task and the semantic relation retrieval task, while dependency-based word embeddings had much worse performance in both tasks. We also found that the word embeddings trained with health-related Wikipedia articles obtained better performance in the health-related relation retrieval tasks than those trained with general Wikipedia articles.ConclusionIt is evident from this study that word embeddings can group terms with diverse semantic relations together. The domain of the training corpus does have impact on the semantic relations represented by word embeddings. We thus recommend using domain-specific corpus to train word embeddings for domain-specific text mining tasks.

Highlights

In the past few years, neural word embeddings have been widely used in text mining
To answer RQ1, we evaluated these embeddings in two semantic relation retrieval tasks: 1) the analogy term retrieval task and 2) the relation term retrieval task
In our recent conference paper [15], we explored this problem with only health-related Wikipedia articles as the training corpus and evaluated different word embeddings with 10 general semantic relations in the semantic relation retrieval tasks

Summary

Introduction

In the past few years, neural word embeddings have been widely used in text mining. the vector representations of word embeddings mostly act as a black box in downstream applications using them, thereby limiting their interpretability. Mining useful but hidden knowledge from unstructured data is a fundamental goal of text mining Towards this goal, the field of natural language processing (NLP) has been rapidly advancing, especially in the area of word representations. In the past few years, neural-networkbased distributional representations of words such as word embeddings have been shown effective in capturing fine-grained semantic relations and syntactic regularities in large text corpora [1,2,3]. They have been widely used in deep learning models such as those for text classification and topic modeling[4, 5]. Recent research is mostly focused on constructing neural networks but not on the interpretations of word embeddings [1, 3, 9, 10]

Methods

Results

Conclusion