MELHISSA: a multilingual entity linking architecture for historical press articles

Elvys Linhares Pontes,Mickaël Coustaty,Ahmed Hamdi,Nicolas Sidere,Antoine Doucet,Jose G Moreno,Luis Adrián Cabrera-Diego,Emanuela Boros

doi:10.1007/s00799-021-00319-6

Abstract

Digital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances in these NLP models, most of them are built for specific languages and contemporary documents that are not optimized for handling historical material that may for instance contain language variations and optical character recognition (OCR) errors. In this work, we focused on the entity linking (EL) task that is fundamental to the indexation of documents in digital libraries. We developed a Multilingual Entity Linking architecture for HIstorical preSS Articles that is composed of multilingual analysis, OCR correction, and filter analysis to alleviate the impact of historical documents in the EL task. The source code is publicly available. Experimentation has been done over two historical document corpora covering five European languages (English, Finnish, French, German, and Swedish). Results have shown that our system improved the global performance for all languages and datasets by achieving an F-score@1 of up to 0.681 and an F-score@5 of up to 0.787.

Highlights

Historical documents are an essential resource in the understanding of our cultural heritage
Digitized historical documents have allowed the use of natural language processing (NLP) tools, such as named entity recognition (NER) [9,10,11] and entity linking (EL) [5,7] for enriching automatically the documents
To adjust to historical documents, we developed several modules to handle multilingualism and errors stemming from the output of optical character recognition (OCR) systems

Summary

Introduction

Historical documents are an essential resource in the understanding of our cultural heritage. Digitized historical documents have allowed the use of natural language processing (NLP) tools, such as named entity recognition (NER) [9,10,11] and entity linking (EL) [5,7] for enriching automatically the documents. This has attracted the attention of numerous digital humanities researchers since it allows quantitative analysis, e.g. towards finding patterns in historical documents on cultural changes, variations in gender bias across historical periods, emerging technological trends, or transitions to new political ideas [3,4]

Results

Discussion

Conclusion