Abstract

A sufficient amount of annotated data is usually required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties and dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties using only resources from data-rich ones. We demonstrate the utility of our approach in the context of Arabic sequence labeling by using a language model fine-tuned on Modern Standard Arabic (MSA) only to predict named entities (NE) and part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer by as large as \texttildelow 10\% F$_1$ (NER) and 2\% accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited amounts of labeled data. We conduct an ablation study and show that the performance boost observed directly results from training data augmentation possible with DA examples via self-training. This opens up opportunities for developing DA models exploiting only MSA resources. Our approach can also be extended to other languages and tasks.

Highlights

  • Neural language models (Xu and Rudnicky, 2000; Bengio et al, 2003) with vectorized word representations (Mikolov et al, 2013) are currently core to a very wide variety of NLP tasks

  • We show that models trained on Modern Standard Arabic (MSA) for named entity recognition (NER) and POS tagging generalize poorly to dialect inputs when used in zero-shot-settings

  • We show the resuts of standard fine-tuning of XLM-R for the two tasks in question

Read more

Summary

Introduction

Neural language models (Xu and Rudnicky, 2000; Bengio et al, 2003) with vectorized word representations (Mikolov et al, 2013) are currently core to a very wide variety of NLP tasks. Our few-shot experiments reveal that selftraining is always a useful strategy that consistently improves over mere fine-tuning, even when all dialect-specific gold data are used for fine-tuning. We discover that self-training helps the model most (% = 59.7) with improving false positives This includes DA tokens whose MSA orthographic counterparts (Shaalan, 2014) are either named entities or trigger words that frequently co-occur with named entities in MSA. Such out-of-MSA tokens occur in highly dialectal contexts (e.g., interjections and idiomatic expressions employed in interpersonal social media communication) or ones

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call