Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Matej Ulčar ,Robnik-Šikonja

doi:10.5281/zenodo.5854584

Abstract

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Abstract

Talk to us

Similar Papers

More From: Zenodo (CERN European Organization for Nuclear Research)

Lead the way for us

Journal: Zenodo (CERN European Organization for Nuclear Research)	Publication Date: Dec 20, 2021
License type: cc-by

Similar Papers

Training Dataset and Dictionary Sizes Matter in BERT Models: The Case of Baltic Languages
Matej Ulčar ... Marko Robnik-Šikonja
-
Matej Ulčar, et. al.Matej Ulčar ... Marko Robnik-Šikonja
01 Jan 2021
01 Jan 2021

Enhancing deep neural networks with morphological information
Matej Klemen ... Marko Robnik-Šikonja
Natural Language Engineering | VOL. 29
Matej Klemen, et. al.Matej Klemen ... Marko Robnik-Šikonja
21 Feb 2022
Natural Language Engineering | VOL. 29

Evaluating Multilingual BERT for Estonian
Claudia Kittask ... Kirill Milintsevich
-
Claudia Kittask, et. al.Claudia Kittask ... Kirill Milintsevich
15 Sep 2020
15 Sep 2020

Improving Pre-Trained Multilingual Model with Vocabulary Expansion
Hai Wang ... Dian Yu
-
Hai Wang, et. al.Hai Wang ... Dian Yu
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Abstract

Talk to us

Similar Papers

More From: Zenodo (CERN European Organization for Nuclear Research)