Abstract
Pre-trained language models (PLMs) for Tagalog can be categorized into two kinds: monolingual models and multilingual models. However, existing monolingual models are only trained in small-scale Wikipedia corpus and multilingual models fail to deal with Tagalog-specific knowledge needed for various downstream tasks. We train three existing models on a much larger corpus: BERT-uncased-base, ELECTRA-uncased-base and RoBERTa-base. At the pre-training stage, we construct a large-scale news text corpus for pre-training in addition to the existing open-source corpora. Experimental results show that our pre-trained models achieve consistently competitive results in various Tagalog-specific natural language processing (NLP) tasks including part-of-speech (POS) tagging, hate speech classification, dengue classification and natural language inference (NLI). Among them, POS tagging dataset is a self-constructed dataset aiming to alleviate the insufficient labeled resource for Tagalog. We will release all pre-trained models and datasets to the community, hoping to facilitate the future development of Tagalog NLP applications.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have