Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment

Sourav Dutta,Gerhard Weikum

doi:10.1162/tacl_a_00119

Abstract

Identifying and linking named entities across information sources is the basis of knowledge acquisition and at the heart of Web search, recommendations, and analytics. An important problem in this context is cross-document co-reference resolution (CCR): computing equivalence classes of textual mentions denoting the same entity, within and across documents. Prior methods employ ranking, clustering, or probabilistic graphical models using syntactic features and distant features from knowledge bases. However, these methods exhibit limitations regarding run-time and robustness. This paper presents the CROCS framework for unsupervised CCR, improving the state of the art in two ways. First, we extend the way knowledge bases are harnessed, by constructing a notion of semantic summaries for intra-document co-reference chains using co-occurring entity mentions belonging to different chains. Second, we reduce the computational cost by a new algorithm that embeds sample-based bisection, using spectral clustering or graph partitioning, in a hierarchical clustering process. This allows scaling up CCR to large corpora. Experiments with three datasets show significant gains in output quality, compared to the best prior methods, and the run-time efficiency of CROCS.

Highlights

1.1 Motivation and Problem Statement We are witnessing another revolution in Web search, user recommendations, and data analytics: transitioning from documents and keywords to data, knowledge, and entities
Examples of this megatrend are the Google Knowledge Graph and its applications, and the IBM Watson technology for deep question answering. These advances have been enabled by the construction of huge knowledge bases (KB’s) such as DBpedia, Yago, or Freebase; the latter forming the core of the Knowledge Graph
We developed the CROCS (CROss-document Co-reference reSolution) framework with unsupervised hierarchical clustering by repeated bisection using spectral clustering or graph partitioning

Summary

Introduction

1.1 Motivation and Problem Statement We are witnessing another revolution in Web search, user recommendations, and data analytics: transitioning from documents and keywords to data, knowledge, and entities. Examples of this megatrend are the Google Knowledge Graph and its applications, and the IBM Watson technology for deep question answering. These advances have been enabled by the construction of huge knowledge bases (KB’s) such as DBpedia, Yago, or Freebase; the latter forming the core of the Knowledge Graph Such semantic resources provide huge collections of entities: people, places, companies, celebrities, movies, etc., along with rich knowledge about their properties and relationships. Named Entity Disambiguation (NED) (see, e.g., (Cucerzan, 2007; Milne & Witten, 2008; Cornolti et al, 2013)) maps a mention string (e.g., a person name like “Bolt” or a noun phrase like “lightning bolt”) onto its proper entity if present in a KB (e.g., the sprinter Usain Bolt)

Objectives

Methods

Findings

Conclusion