Abstract
AbstractText classification is a widely studied problem and has broad applications. In many real-world problems, the number of texts for training classification models is limited, which renders these models prone to overfitting. To address this problem, we propose SSL-Reg, a data-dependent regularization approach based on self-supervised learning (SSL). SSL (Devlin et al., 2019a) is an unsupervised learning approach that defines auxiliary tasks on input data without using any human-provided labels and learns data representations by solving these auxiliary tasks. In SSL-Reg, a supervised classification task and an unsupervised SSL task are performed simultaneously. The SSL task is unsupervised, which is defined purely on input texts without using any human- provided labels. Training a model using an SSL task can prevent the model from being overfitted to a limited number of class labels in the classification task. Experiments on 17 text classification datasets demonstrate the effectiveness of our proposed method. Code is available at https://github.com/UCSD-AI4H/SSReg.
Highlights
Text classification (Korde and Mahender, 2012; Lai et al, 2015; Wang et al, 2017; Howard and Ruder, 2018) is a widely studied problem in natural language processing and finds broad applications
To address overfitting problems in text classification, we propose a data-dependent regularizer called SSL-Reg based on self-supervised learning (SSL) (Devlin et al, 2019a; He et al, 2019; Chen et al, 2020) and use it to regularize the training of text classification models, where a supervised classification task and an unsupervised SSL task are performed simultaneously
We propose to use self-supervised learning to alleviate overfitting in text classification problems
Summary
Text classification (Korde and Mahender, 2012; Lai et al, 2015; Wang et al, 2017; Howard and Ruder, 2018) is a widely studied problem in natural language processing and finds broad applications. Give clinical notes of a patient, judge whether this patient has heart diseases. In many real-world text classification problems, texts available for training are oftentimes limited. It is difficult to obtain a lot of clinical notes from hospitals due to concern of patient privacy. It is well known that when training
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Transactions of the Association for Computational Linguistics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.