Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks

Ruan Chaves Rodrigues,Jéssica Rodrigues,Pedro Vitor Quinta De Castro,Nádia Felix Felipe Da Silva,Anderson Soares

doi:10.1007/978-3-030-41505-1_23

Abstract

Deep neural language models which achieved state-of-the-art results on downstream natural language processing tasks have recently been trained for the Portuguese language. However, studies that systematically evaluate such models are still necessary for several applications. In this paper, we propose to evaluate the performance of deep neural language models on the semantic similarity tasks provided by the ASSIN dataset against classical word embeddings, both for Brazilian Portuguese and for European Portuguese. Our experiments indicate that the ELMo language model was able to achieve better accuracy than any other pretrained model which has been made publicly available for the Portuguese language, and that performing vocabulary reduction on the dataset before training not only improved the standalone performance of ELMo, but also improved its performance while combined with classical word embeddings. We also demonstrate that FastText skip-gram embeddings can have a significantly better performance on semantic similarity tasks than it was indicated by previous studies in this field.

Full Text