Improving Intent Classification Using Unlabeled Data from Large Corpora

Gabriel Bercaru,Costin-Gabriel Chiru,Ciprian-Octavian Truică,Traian Rebedea

doi:10.3390/math11030769

Gabriel Bercaru, Costin-Gabriel Chiru + Show 2 more

Open Access

https://doi.org/10.3390/math11030769

Copy DOI

Journal: Mathematics	Publication Date: Feb 3, 2023
Citations: 2	License type: CC BY 4.0

Affiliation: Polytechnic University of Bucharest

Abstract

Intent classification is a central component of a Natural Language Understanding (NLU) pipeline for conversational agents. The quality of such a component depends on the quality of the training data, however, for many conversational scenarios, the data might be scarce; in these scenarios, data augmentation techniques are used. Having general data augmentation methods that can generalize to many datasets is highly desirable. The work presented in this paper is centered around two main components. First, we explore the influence of various feature vectors on the task of intent classification using RASA’s text classification capabilities. The second part of this work consists of a generic method for efficiently augmenting textual corpora using large datasets of unlabeled data. The proposed method is able to efficiently mine for examples similar to the ones that are already present in standard, natural language corpora. The experimental results show that using our corpus augmentation methods enables an increase in text classification accuracy in few-shot settings. Particularly, the gains in accuracy raise up to 16% when the number of labeled examples is very low (e.g., two examples). We believe that our method is important for any Natural Language Processing (NLP) or NLU task in which labeled training data are scarce or expensive to obtain. Lastly, we give some insights into future work, which aims at combining our proposed method with a semi-supervised learning approach.

Full Text