Abstract

Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL heuristics are generally designed on the principle of selecting uncertain yet representative training instances, where annotating these instances may reduce a large number of errors. However, in an empirical study across six typologically diverse languages (German, Swedish, Galician, North Sami, Persian, and Ukrainian), we found the surprising result that even in an oracle scenario where we know the true uncertainty of predictions, these current heuristics are far from optimal. Based on this analysis, we pose the problem of AL as selecting instances that maximally reduce the confusion between particular pairs of output tags. Extensive experimentation on the aforementioned languages shows that our proposed AL strategy outperforms other AL strategies by a significant margin. We also present auxiliary results demonstrating the importance of proper calibration of models, which we ensure through cross-view training, and analysis demonstrating how our proposed strategy selects examples that more closely follow the oracle data distribution. The code is publicly released here. 1

Highlights

  • Part-of-speech (POS) tagging is a crucial step for language understanding, both being used in automatic language understanding applications such as named entity recognition (NER; Ankita and Nazeer, 2018) and question answering (QA; Wang et al, 2018), and being used in manual lan

  • With the help of a senior Griko linguist (Linguist3), we identified a few types of conjunctions that are always coordinating: variations of ‘‘and’’, and of ‘‘or’’ (e or i)

  • We have presented a novel active learning method for low-resource POS tagging that works by reducing confusion between output tags

Read more

Summary

Introduction

Part-of-speech (POS) tagging is a crucial step for language understanding, both being used in automatic language understanding applications such as named entity recognition (NER; Ankita and Nazeer, 2018) and question answering (QA; Wang et al, 2018), and being used in manual lan-. Because we would like to correct errors where tokens with true labels of DET are mislabeled by the model as PRO, asking the human annotator to tag an instance with a true label of PRO, even if it is uncertain, is not likely to be of much benefit. Inspired by this observation, we pose the problem of AL for POS tagging as selecting tokens that maximally reduce the confusion between the output tags. We collect 300 new token-level annotations which will help further Griko NLP

Background
Confusion-Reducing Active Learning
Model Architecture
Cross-view Training Regimen
Cross-Lingual Transfer Learning
Simulation Experiments
Analysis
Oracle Results
Effect of Cross-View Training
Human Annotation Experiment
Results
Related Work
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call