Abstract

This paper proposes a graph-based Named Entity Linking (NEL) algorithm named REDEN for the disambiguation of authors' names in French literary criticism texts and scientific essays from the 19th and early 20th centuries. The algorithm is described and evaluated according to the two phases of NEL as reported in current state of the art, namely, candidate retrieval and candidate selection. REDEN leverages knowledge from different Linked Data sources in order to select candidates for each author mention, subsequently crawls data from other Linked Data sets using equivalence links (e.g., owl:sameAs), and, finally, fuses graphs of homologous individuals into a non-redundant graph well-suited for graph centrality calculation; the resulting graph is used for choosing the best referent. The REDEN algorithm is distributed in open-source and follows current standards in digital editions (TEI) and semantic Web (RDF). Its integration into an editorial workflow of digital editions in Digital humanities and cultural heritage projects is entirely plausible. Experiments are conducted along with the corresponding error analysis in order to test our approach and to help us to study the weaknesses and strengths of our algorithm, thereby to further improvements of REDEN.

Highlights

  • To discover new information and to compare it to other sources of information are two important ‘scholarly primitives’, basic activities common to research across humanities disciplines [1], and especially to those which involve the study of textual sources

  • Mentions of persons in these text were manually annotated by experts18; Uniform Resource Identifiers (URI) assigned to mentions are those from IDREF, or NIL when experts did not know to whom the mention refers to or could not find an entry in IDREF

  • In previous experiments [10], we compared the correctness rates obtained by REDEN and a widespread Named Entity Linking (NEL) tool, DBSL

Read more

Summary

Introduction

To discover new information and to compare it to other sources of information are two important ‘scholarly primitives’, basic activities common to research across humanities disciplines [1], and especially to those which involve the study of textual sources. The XML-based Text Encoding Initiative (TEI) standard1 [2] for digital editions allows for the explicit encoding of information in texts, so that they become machine readable and searchable. XML-TEI enables the semantic enrichment of texts, namely, the annotation of portions of texts with tags that connect them with other sources of information, that are not present in the original text. If the target of the link contains structured, machine readable information, semantically enriched texts can be processed and analysed in a non-linear and automatic way, discovering connections between different (parts of) texts, aggregating data, comparing and visualising it. The production of quality digital editions is not an easy task, and requires manual annotation and validation. Natural Language Processing (NLP) tools are often used to speed up the process to a great extent

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call