Data Paucity and Low Resource Scenarios

Mona Diab

doi:10.1145/3394486.3409565

Abstract

In an era of unstructured data abundance, you would think that we have solved our data requirements for building robust systems for language processing. However, this is not the case if we think on a global scale with over 7000 languages where only a handful have digital resources. Systems at scale with good performance typically require annotated resources that cover the genres and domain divides. Moreover, the existence of a handful of resources in some languages is a reflection of the digital disparity in various societies leading to inadvertent biases in systems. In this talk I will show some solutions for low resource scenarios, both cross domain and genres as well as cross lingually. I will talk about handling data paucity from the angle of devising principled metrics for data selection. Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications. We specifically look at the problem of Intent classification (IC) as well as sentiment analysis. From the modeling side, for low resource scenarios for genres and domains, we investigate some techniques for few shot learning for the problems of intent classification (IC) and the sequence learning models for slot filling (SF) which are both core components in dialogue systems for task oriented chatbots. Current IC/SF models perform poorly when the number of training examples per class is small. We propose a new few-shot learning task, few-shot IC/SF, to study and improve the performance of IC and SF models on classes not seen at training time in ultra low resource scenarios. We establish a few-shot IC/SF benchmark. We show that two popular few-shot learning algorithms, model agnostic meta learning (MAML) and prototypical networks, outperform a fine tuning baseline on this benchmark. From a multilingual perspective, we bootstrap cross lingual systems for inducing word and sentence level representations. Most existing methods for automatic bilingual dictionary induction rely on prior alignments between the source and target languages, such as parallel corpora or seed dictionaries. For many language pairs, such supervised alignments are not readily available. We propose an unsupervised approach for learning a bilingual dictionary for a pair of languages given their independently-learned monolingual word embeddings. The proposed method exploits local and global structures in monolingual vector spaces to align them such that similar words are mapped to each other. Finally, I will show you how we use projection for cross lingual emotion detection and semantic role labeling. We leverage a multitask learning framework coupled with an annotation projection method from a rich-resource language to a low-resource language through parallel data, and train a predictive models using the projected data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Data Paucity and Low Resource Scenarios

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Learning to Classify Intents and Slot Labels Given a Handful of Examples
Jason Krone ... Yi Zhang
-
Jason Krone, et. al.Jason Krone ... Yi Zhang
01 Jan 2020
01 Jan 2020

Few-shot learning with adaptively initialized task optimizer: a practical meta-learning approach
Han-Jia Ye ... Xiang-Rong Sheng
Machine Learning | VOL. 109
Han-Jia Ye, et. al.Han-Jia Ye ... Xiang-Rong Sheng
10 Oct 2019
Machine Learning | VOL. 109

Strategies to Improve Few-shot Learning for Intent Classification and Slot Filling

-

09 Jul 2022
09 Jul 2022

Az Örnekle Öğrenme Problemleri için MAML ve ProtoNet Algoritmalarının İncelenmesi
Ayla Gülcü ... Muhammet Alkan
European Journal of Science and Technology | VOL. -
Ayla Gülcü, et. al.Ayla Gülcü ... Muhammet Alkan
12 Dec 2020
European Journal of Science and Technology | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data Paucity and Low Resource Scenarios

Abstract

Talk to us

Similar Papers