Contrastive learning from label distribution: A case study on text classification

Tao Qian,Fei Li,Meishan Zhang,Guonian Jin,Ping Fan,Wenhua Dai

doi:10.1016/j.neucom.2022.07.076

Abstract

State-of-the-art text classification models are dominated by deep neural networks, but they still struggle to the issue of poor generalization ability when using cross entropy loss for training. One of the reasons is the training samples are usually annotated with hard labels, ignoring the inter-label relationship among these labels. This results in the outputted distributions to violate the label-correlation. Although the widely used contrastive learning is able to learn highly-expressive feature representations that generalize well across the training and test sets, contrastive learning in the embedding space is difficult to model the label correlation and may still output unreasonable label distributions. In this paper, we suggest contrastive learning using label distributions and present a novel label-level contrastive learning (LLCL) paradigm, which can constrain the unreasonable label distributions from model outputs. We hypothesize that the label distributions of the instances in the same class are more similar than those from other classes. We introduce two label-level contrastive learning losses, namely supervised contrastive learning and self-supervised contrastive learning. After adding our proposed losses to the cross-entropy loss as regularizer for the training text classification model, our model obtains the average improvement of 0.74% over the strong RoBERTa-Large baseline on ten datasets. Particularly, our contrastive learning in the label space can effectively capture label correlations better than that in the embedding space, achieving an improvement of 6.2% in the top-2 accuracy on the SST-5 dataset. We also demonstrate that our proposed method is more effective, especially in the text classification tasks with a large label space or limited labeled data. Last but not least, our model does not rely on any kind of specialized architectures, data augmentation methods, or additional unsupervised data.

Full Text