Improving Candidate Generation for Low-resource Cross-lingual Entity Linking

Shuyan Zhou,Shruti Rijhwani,Graham Neubig,Jaime Carbonell,John Wieting

doi:10.1162/tacl_a_00303

Abstract

Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL. 1

Highlights

Entity linking (EL; Bunescu and Paşca (2006); Cucerzan (2007); Dredze et al (2010); Hoffart et al (2011)) associates entity mentions in a document with their entries in a Knowledge Base (KB)
Given a document and named entity mentions identified by a Named Entity Recognition (NER) model, there are two primary steps in an XEL system: (1) candidate generation, in which a model retrieves a short list of plausible KB entities for each mention and (2) disambiguation, in which a model selects the most likely KB entity from the candidate list
We evaluate our proposed methods on four real world XEL datasets provided by DARPA Low Resource Languages for Emergent Incidents (LORELEI) (Strassel and Tracey, 2016), as well as three other datasets we create with Wikipedia anchor-text and inter-language links (§5)

Summary

Introduction

Entity linking (EL; Bunescu and Paşca (2006); Cucerzan (2007); Dredze et al (2010); Hoffart et al (2011)) associates entity mentions in a document with their entries in a Knowledge Base (KB). We focus on cross-lingual entity linking (XEL; McNamee et al (2011); Ji et al (2015)). The quality of candidate lists will influence the performance of the end-to-end XEL system, as correct entities not included in this list will not be recovered by the disambiguation model. Given a set of mentions M = {m1, m2, ..., mN } extracted from multiple documents in the source language, and an English KB KEN that contains millions of entities with unique names, the goal of a candidate generation model is to retrieve a list of possible candidate entities ei = {ei,, ei,2, ..., ei,n} from KEN for each mi ∈ M. The performance of candidate generation is measured by the gold candidate recall, which is the proportion of retrieved candidate lists that contains the correct entity. We follow (Yamada et al, 2017; Ganea and Hofmann, 2017) to ignore mentions whose linked entity does not exist in the KB in this work.

Methods

Results

Conclusion