The Corpora They Are a-Changing: a Case Study in Italian Newspapers

Pierpaolo Basile,Rossella Varvara,Tommaso Caselli,Pierluigi Cassotti,Annalina Caputo

doi:10.18653/v1/2021.lchange-1.3

Abstract

Pierpaolo Basile, Annalina Caputo, Tommaso Caselli, Pierluigi Cassotti, Rossella Varvara. Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021. 2021.

Highlights

Natural languages are de facto living entities always subject to change and evolution
The Natural Language Processing (NLP) community has developed an interest in historical linguistics, and in particular in the study of lexical semantics change (LSC)
It has scrutinised the robustness of the LSCs, detected by a common algorithm, across different corpora

Summary

Introduction

Natural languages are de facto living entities always subject to change and evolution. Distributional models are powerful, yet they suffer from some limitations, namely: (i) they require large amount of text; (ii) they are sensitive to the type of texts and the distribution (i.e., frequency) of the lexical items; and (iii) they tend to conflate different types of information and variables such as semantics, social and topical information This contribution investigates two strictly connected aspects: the reliability of LSC benchmark data and the sensitivity of a state-of-the-art approach for LSC, grounded on the distributional hypothesis, when changing the source corpus. The results of our work will help to shed light on systems’ robustness and stability by verifying whether methods tuned on one corpus can be directly applied to another

Methodology

Testing for Robustness and Independence

Models into the Wild

Conclusion and Future Work

Findings

B Cosine similarities