Abstract
Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL heuristics are generally designed on the principle of selecting uncertain yet representative training instances, where annotating these instances may reduce a large number of errors. However, in an empirical study across six typologically diverse languages (German, Swedish, Galician, North Sami, Persian, and Ukrainian), we found the surprising result that even in an oracle scenario where we know the true uncertainty of predictions, these current heuristics are far from optimal. Based on this analysis, we pose the problem of AL as selecting instances that maximally reduce the confusion between particular pairs of output tags. Extensive experimentation on the aforementioned languages shows that our proposed AL strategy outperforms other AL strategies by a significant margin. We also present auxiliary results demonstrating the importance of proper calibration of models, which we ensure through cross-view training, and analysis demonstrating how our proposed strategy selects examples that more closely follow the oracle data distribution. The code is publicly released here. 1
Highlights
Part-of-speech (POS) tagging is a crucial step for language understanding, both being used in automatic language understanding applications such as named entity recognition (NER; Ankita and Nazeer, 2018) and question answering (QA; Wang et al, 2018), and being used in manual lan
With the help of a senior Griko linguist (Linguist3), we identified a few types of conjunctions that are always coordinating: variations of ‘‘and’’, and of ‘‘or’’ (e or i)
We have presented a novel active learning method for low-resource POS tagging that works by reducing confusion between output tags
Summary
Part-of-speech (POS) tagging is a crucial step for language understanding, both being used in automatic language understanding applications such as named entity recognition (NER; Ankita and Nazeer, 2018) and question answering (QA; Wang et al, 2018), and being used in manual lan-. Because we would like to correct errors where tokens with true labels of DET are mislabeled by the model as PRO, asking the human annotator to tag an instance with a true label of PRO, even if it is uncertain, is not likely to be of much benefit. Inspired by this observation, we pose the problem of AL for POS tagging as selecting tokens that maximally reduce the confusion between the output tags. We collect 300 new token-level annotations which will help further Griko NLP
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have