Concept Grounding to Multiple Knowledge Bases via Indirect Supervision

Chen-Tse Tsai,Dan Roth

doi:10.1162/tacl_a_00089

Abstract

We consider the problem of disambiguating concept mentions appearing in documents and grounding them in multiple knowledge bases, where each knowledge base addresses some aspects of the domain. This problem poses a few additional challenges beyond those addressed in the popular Wikification problem. Key among them is that most knowledge bases do not contain the rich textual and structural information Wikipedia does; consequently, the main supervision signal used to train Wikification rankers does not exist anymore. In this work we develop an algorithmic approach that, by carefully examining the relations between various related knowledge bases, generates an indirect supervision signal it uses to train a ranking model that accurately chooses knowledge base entries for a given mention; moreover, it also induces prior knowledge that can be used to support a global coherent mapping of all the concepts in a given document to the knowledge bases. Using the biomedical domain as our application, we show that our indirectly supervised ranking model outperforms other unsupervised baselines and that the quality of this indirect supervision scheme is very close to a supervised model. We also show that considering multiple knowledge bases together has an advantage over grounding concepts to each knowledge base individually.

Highlights

Grounding entities and concepts appearing in text documents to a knowledge base (KB) has become a popular method for contextually disambiguating them and can be used for focused knowledge acquisition
Without using any document in training and no annotated supervision, our approach achieves better ranking results than all previous approaches tried on this problem. We explore another advantage of using multiple KBs; we show that, since concepts are represented in different ways in different KBs, there are some natural constraints between these representations
Feature engineering is done by doing cross validation on the indirect supervision examples, we can use all documents as the test set for all approaches

Summary

Introduction

Grounding entities and concepts appearing in text documents to a knowledge base (KB) has become a popular method for contextually disambiguating them and can be used for focused knowledge acquisition It has been shown a valuable component for several natural language processing and information extraction tasks across different domains. While Wikipedia is an excellent general purpose encyclopedic resource, when the text is domain specific, it may not be the single ideal resource; the text could be better “covered” by multiple ontological or encyclopedic resources. This is clearly the case for scientific text which is often covered by multiple ontologies, each addressing some aspects of the domain. The ontologies provide complementary information, but they overlap and, in these cases, make use of different vocabulary and provide different relevant information

Objectives

Results

Conclusion