Abstract

The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.

Highlights

  • The first known Portuguese certificate was issued by Torre do Tombo (TT), an institution over 600 years old that is still the largest

  • In order to do so, we propose the use of Named Entity Recognition (NER), using

  • It appears that in most cases, the BILSTM-Conditional Random Field (CRF) model generated with TensorFlow obtains the best results with an F1-score between 86.32% and 100%, followed by spaCy with an F1-score between 70.09% and 100%, and OpenNLP with an F1-score between 62.67 and 100%

Read more

Summary

Introduction

J.C. NER in Archival Finding Aids: Throughout the history of Portugal, there was a need to create an archive where information about the kingdom was recorded. The first known Portuguese certificate was issued by Torre do Tombo (TT), an institution over 600 years old that is still the largest. Portuguese archive, storing a significant part of Portuguese historical and administrative records. The volume of information contained in national archives has considerably increased, and today there are hundreds of archives spread across the country. Most of them have information from the public administration containing records from the

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call