Abstract

Traditional text classification requires thousands of annotated data or an additional Neural Machine Translation (NMT) system, which are expensive to obtain in real applications. This paper presents a Contrast-Enhanced Semi-supervised Text Classification (CEST) framework under label-limited settings without incorporating any NMT systems. We propose a certainty-driven sample selection method and a contrast-enhanced similarity graph to utilize data more efficiently in self-training, alleviating the annotation-starving problem. The graph imposes a smoothness constraint on the unlabeled data to improve the coherence and the accuracy of pseudo-labels. Moreover, CEST formulates the training as a “learning from noisy labels” problem and performs the optimization accordingly. A salient feature of this formulation is the explicit suppression of the severe error propagation problem in conventional semi-supervised learning. With solely 30 labeled data per class for both training and validation dataset, CEST outperforms the previous state-of-the-art algorithms by 2.11% accuracy and only falls within the 3.04% accuracy range of fully-supervised pre-training language model fine-tuning on thousands of labeled data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.