SenseDefs: a multilingual corpus of semantically annotated textual definitions

Jose Camacho-Collados,Claudio Delli Bovi,Alessandro Raganato,Roberto Navigli

doi:10.1007/s10579-018-9421-3

Abstract

Definitional knowledge has proved to be essential in various Natural Language Processing tasks and applications, especially when information at the level of word senses is exploited. However, the few sense-annotated corpora of textual definitions available to date are of limited size: this is mainly due to the expensive and time-consuming process of annotating a wide variety of word senses and entity mentions at a reasonably high scale. In this paper we present SenseDefs, a large-scale high-quality corpus of disambiguated definitions (or glosses) in multiple languages, comprising sense annotations of both concepts and named entities from a wide-coverage unified sense inventory. Our approach for the construction and disambiguation of this corpus builds upon the structure of a large multilingual semantic network and a state-of-the-art disambiguation system: first, we gather complementary information of equivalent definitions across different languages to provide context for disambiguation; then we refine the disambiguation output with a distributional approach based on semantic similarity. As a result, we obtain a multilingual corpus of textual definitions featuring over 38 million definitions in 263 languages, and we publicly release it to the research community. We assess the quality of SenseDefs’s sense annotations both intrinsically and extrinsically on Open Information Extraction and Sense Clustering tasks.

Highlights

In addition to lexicography, where their use is of paramount importance, textual definitions drawn from dictionaries or encyclopedias have been widely used in various Natural Language Processing (NLP) tasks and applications
We report the results of the original NASARI English lexical vectors (NASARI)25 and the NASARI-based vectors obtained from the enriched BabelNet semantic network (NASARI ? SENSEDEFS)
In this paper we presented SENSEDEFS, a large-scale multilingual corpus of disambiguated textual definitions

Summary

Introduction

In addition to lexicography, where their use is of paramount importance, textual definitions drawn from dictionaries or encyclopedias have been widely used in various Natural Language Processing (NLP) tasks and applications. Textual definitions are today widely available in knowledge resources of various kinds, ranging from lexicons and dictionaries, such as WordNet (Miller et al 1990) or Wiktionary, to encyclopedic knowledge bases, such as Wikidata Does BabelNet represent the largest sense inventory available for disambiguation and entity linking, its internal structure, based on inter-resource mappings, enables us to collect all the definitional knowledge associated with a given definiendum inside the various individual resources and for any available languages. This is a crucial step for context-rich disambiguation In the following we describe the resources from which the definitions are extracted: WordNet (Sect. 2.1.1), Wikipedia (Sect. 2.1.2), Wikidata (Sect. 2.1.3), Wiktionary and OmegaWiki (Sect. 2.1.4)

Objectives

Methods

Results

Conclusion