Linking Entities from Text to Hundreds of RDF Datasets for Enabling Large Scale Entity Enrichment

Michalis Mountantonakis,Yannis Tzitzikas

doi:10.3390/knowledge2010001

Michalis Mountantonakis, Yannis Tzitzikas

Open Access

https://doi.org/10.3390/knowledge2010001

Copy DOI

Abstract

There is a high increase in approaches that receive as input a text and perform named entity recognition (or extraction) for linking the recognized entities of the given text to RDF Knowledge Bases (or datasets). In this way, it is feasible to retrieve more information for these entities, which can be of primary importance for several tasks, e.g., for facilitating manual annotation, hyperlink creation, content enrichment, for improving data veracity and others. However, current approaches link the extracted entities to one or few knowledge bases, therefore, it is not feasible to retrieve the URIs and facts of each recognized entity from multiple datasets and to discover the most relevant datasets for one or more extracted entities. For enabling this functionality, we introduce a research prototype, called LODsyndesisIE, which exploits three widely used Named Entity Recognition and Disambiguation tools (i.e., DBpedia Spotlight, WAT and Stanford CoreNLP) for recognizing the entities of a given text. Afterwards, it links these entities to the LODsyndesis knowledge base, which offers data enrichment and discovery services for millions of entities over hundreds of RDF datasets. We introduce all the steps of LODsyndesisIE, and we provide information on how to exploit its services through its online application and its REST API. Concerning the evaluation, we use three evaluation collections of texts: (i) for comparing the effectiveness of combining different Named Entity Recognition tools, (ii) for measuring the gain in terms of enrichment by linking the extracted entities to LODsyndesis instead of using a single or a few RDF datasets and (iii) for evaluating the efficiency of LODsyndesisIE.

Highlights

The target of Information Extraction (IE) [1,2] approaches is to extract information either from unstructured or semi-structured sources
It is feasible to enrich the contents for these entities, which can be of primary importance for several tasks, e.g., for facilitating manual annotation, for enabling hyperlink creation, for offering content enrichment, for improving data veracity and others
Regarding Hypothesis 1 (H1), in Section 4.3, we evaluate the gain of combining different tools for Entity Recognition, whereas in Section 4.4, we report measurements related to Hypothesis 2 (H2), i.e., for evaluating the gain of using multiple datasets for the recognized entities for several tasks

Summary

Introduction

The target of Information Extraction (IE) [1,2] approaches is to extract information either from unstructured (e.g., plain text) or semi-structured (e.g., relational databases) sources. Due to the high increase in and popularity of RDF Knowledge Bases (or datasets) [3,4], a high number of ER approaches link the recognized entities of a given source (e.g., text) to popular RDF datasets [5]. Thereby, by using such tools, it is not trivial to provide services for the recognized entities by combining information from multiple datasets, e.g., it is difficult to find all the related URIs (Uniform Resource Identifiers) of each entity, to collect all its triples (i.e., facts) and to verify facts that are included in the given text. The triple hdbp:Elijah_Wood, dbp:occupation, dbp:Actori, contains three

Results

Discussion

Conclusion