Combining Self-supervised Learning and Active Learning for Disfluency Detection
Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.
- Research Article
49
- 10.1609/aaai.v34i05.6456
- Apr 3, 2020
- Proceedings of the AAAI Conference on Artificial Intelligence
Most existing approaches to disfluency detection heavily rely on human-annotated data, which is expensive to obtain in practice. To tackle the training data bottleneck, we investigate methods for combining multiple self-supervised tasks-i.e., supervised tasks where data can be collected without manual labeling. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled news data, and propose two self-supervised pre-training tasks: (i) tagging task to detect the added noisy words. (ii) sentence classification to distinguish original sentences from grammatically-incorrect sentences. We then combine these two tasks to jointly train a network. The pre-trained network is then fine-tuned using human-annotated disfluency detection training data. Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems (trained using the full dataset) by using less than 1% (1000 sentences) of the training data. Our method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard.
- Research Article
14
- 10.34133/plantphenomics.0037
- Jan 1, 2023
- Plant Phenomics
The rise of self-supervised learning (SSL) methods in recent years presents an opportunity to leverage unlabeled and domain-specific datasets generated by image-based plant phenotyping platforms to accelerate plant breeding programs. Despite the surge of research on SSL, there has been a scarcity of research exploring the applications of SSL to image-based plant phenotyping tasks, particularly detection and counting tasks. We address this gap by benchmarking the performance of 2 SSL methods—momentum contrast (MoCo) v2 and dense contrastive learning (DenseCL)—against the conventional supervised learning method when transferring learned representations to 4 downstream (target) image-based plant phenotyping tasks: wheat head detection, plant instance detection, wheat spikelet counting, and leaf counting. We studied the effects of the domain of the pretraining (source) dataset on the downstream performance and the influence of redundancy in the pretraining dataset on the quality of learned representations. We also analyzed the similarity of the internal representations learned via the different pretraining methods. We find that supervised pretraining generally outperforms self-supervised pretraining and show that MoCo v2 and DenseCL learn different high-level representations compared to the supervised method. We also find that using a diverse source dataset in the same domain as or a similar domain to the target dataset maximizes performance in the downstream task. Finally, our results show that SSL methods may be more sensitive to redundancy in the pretraining dataset than the supervised pretraining method. We hope that this benchmark/evaluation study will guide practitioners in developing better SSL methods for image-based plant phenotyping.
- Research Article
46
- 10.1016/j.media.2022.102539
- Oct 1, 2022
- Medical Image Analysis
CS-CO: A Hybrid Self-Supervised Visual Representation Learning Method for H&E-stained Histopathological Images.
- Conference Article
2
- 10.1109/aiid51893.2021.9456562
- May 28, 2021
- Control theory & applications
Self-supervised learning can be adopted to mine deep semantic information of visual data without a large number of human-annotated supervision by using a pretext task to pretrain a model. In this study, we proposed a novel self-supervised learning paradigm, namely multi-task self-supervised (MTSS) representation learning. Unlike existing self-supervised learning methods, which pretrain neural networks on the pretext task and then fine-tune the parameters of neural networks on the downstream task, in our scheme, downstream and pretext tasks are considered primary and auxiliary tasks, respectively, and are trained simultaneously. Our method involves maximizing the similarity of two augmented views of an image as an auxiliary task and using a multi-task network to train the primary task alongside the auxiliary task. We evaluated the proposed method on standard datasets and backbones through a rigorous experimental procedure. Experimental results revealed that proposed MTSS can achieve better performance and robustness than other self-supervised learning methods on multiple image classification data sets without using negative sample pairs and large batches. This simple yet effective method can inspire people to rethink self-supervised learning.
- Research Article
18
- 10.1109/jbhi.2023.3331626
- Feb 1, 2024
- IEEE journal of biomedical and health informatics
This paper presents a systematic investigation into the effectiveness of Self-Supervised Learning (SSL) methods for Electrocardiogram (ECG) arrhythmia detection. We begin by conducting a novel analysis of the data distributions on three popular ECG-based arrhythmia datasets: PTB-XL, Chapman, and Ribeiro. To the best of our knowledge, our study is the first to quantitatively explore and characterize these distributions in the area. We then perform a comprehensive set of experiments using different augmentations and parameters to evaluate the effectiveness of various SSL methods, namely SimCRL, BYOL, and SwAV, for ECG representation learning, where we observe the best performance achieved by SwAV. Furthermore, our analysis shows that SSL methods achieve highly competitive results to those achieved by supervised state-of-the-art methods. To further assess the performance of these methods on both In-Distribution (ID) and Out-of-Distribution (OOD) ECG data, we conduct cross-dataset training and testing experiments. Our comprehensive experiments show almost identical results when comparing ID and OOD schemes, indicating that SSL techniques can learn highly effective representations that generalize well across different OOD datasets. This finding can have major implications for ECG-based arrhythmia detection. Lastly, to further analyze our results, we perform detailed per-disease studies on the performance of the SSL methods on the three datasets.
- Book Chapter
- 10.1007/978-3-030-32233-5_28
- Jan 1, 2019
In recent years, neural machine translation (NMT) has made great progress. Different models, such as neural networks using recurrence, convolution and self-attention, have been proposed and various online translation systems can be available. It becomes a big challenge on how to choose the best translation among different systems. In this paper, we attempt to tackle this task and it can be intuitively considered as the Quality Estimation (QE) problem that requires enough human-annotated data in which each translation hypothesis is scored by human. However, we do not have rich data with high-quality human annotations in practice. To solve this problem, we resort to bilingual training data and propose a new method of mixed MT metrics to automatically score the translation hypotheses from different systems with their references so as to construct the pseudo human-annotated data. Based on the pseudo training data, we further design a novel QE model based on Multi-BERT and Bi-RNN with a joint-encoding strategy. Extensive experiments demonstrate that our proposed method can achieve promising results for the task to select the best translation from various systems.
- Conference Article
18
- 10.1109/iccvw54120.2021.00123
- Oct 1, 2021
Many self-supervised learning (SSL) methods have been successful in learning semantically meaningful visual representations by solving pretext tasks. However, prior work in SSL focuses on tasks like object recognition or detection, which aim to learn object shapes and assume that the features should be invariant to concepts like colors and textures. Thus, these SSL methods perform poorly on downstream tasks where these concepts provide critical information. In this paper, we present an SSL framework that enables us to learn color and texture-aware features without requiring any labels during training. Our approach consists of three self-supervised tasks designed to capture different concepts that are neglected in prior work that we can select from depending on the needs of our downstream tasks. Our tasks include learning to predict color histograms and discriminate shapeless local patches and textures from each instance. We evaluate our approach on fashion compatibility using Polyvore Outfits and In-Shop Clothing Retrieval using Deep-fashion, improving upon prior SSL methods by 9.5-16%, and even outperforming some supervised approaches on Polyvore Outfits despite using no labels. We also show that our approach can be used for transfer learning, demonstrating that we can train on one dataset while achieving high performance on a different dataset.
- Research Article
5
- 10.1016/j.procs.2023.08.184
- Jan 1, 2023
- Procedia Computer Science
Self-Supervised Learning with Atom Replacement for Catalyst Energy Prediction by Graph Neural Networks
- Research Article
46
- 10.1016/j.jpowsour.2021.230584
- Dec 1, 2021
- Journal of Power Sources
Self-supervised reinforcement learning-based energy management for a hybrid electric vehicle
- Research Article
- 10.15323/mint.2022.8.2.3.18
- Aug 31, 2022
- Moving Image & Technology (MINT)
In the field of video representation, the self-supervised learning method was efficiently applied to the pre-training domain to downstream tasks using large and many unlabeled datasets. The basic approaches are typically based on a pre-text task method and a contrastive learning method. First, in a pre-text task method, a user defines a new problem and uses it as a proxy for self-supervised learning. Second, contrastive learning is a method of predicting the relationship between instances by assuming that feature values extracted through a certain model will have similar information between instances. According to the recent popularity of unsupervised learning, various self-supervised methods as well as the above methods are used in the field of video representation learning. Effective video representation learning is performed by fusing the multimodality advantages of video and the features of audio-visual information with various deep learning techniques. In this paper, recent representative methods of self-supervised video representation learning are summarized and described. Additionally, we provide a brief overview of how to utilize multimodality (audio-visual) information, which is the strength of the video.
- Research Article
3
- 10.1016/j.eswa.2017.07.016
- Jul 13, 2017
- Expert Systems with Applications
AnnoFin–A hybrid algorithm to annotate financial text
- Research Article
21
- 10.1038/s42003-023-05310-2
- Sep 11, 2023
- Communications Biology
Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.
- Research Article
35
- 10.1016/j.compag.2023.107967
- Jun 9, 2023
- Computers and Electronics in Agriculture
CLA: A self-supervised contrastive learning method for leaf disease identification with domain adaptation
- Research Article
3
- 10.1007/s10815-024-03080-2
- Mar 12, 2024
- Journal of assisted reproduction and genetics
To study the effectiveness of whole-scenario embryo identification using a self-supervised learning encoder (WISE) in in vitro fertilization (IVF) on time-lapse, cross-device, and cryo-thawed scenarios. WISE was based on the vision transformer (ViT) architecture and masked autoencoders (MAE), a self-supervised learning (SSL) method. To train WISE, we prepared three datasets including the SSL pre-training dataset, the time-lapse identification dataset, and the cross-device identification dataset. To identify whether pairs of images were from the same embryos in different scenarios in the downstream identification tasks, embryo images including time-lapse and microscope images were first pre-processed through object detection, cropping, padding, and resizing, and then fed into WISE to get predictions. WISE could accurately identify embryos in the three scenarios. The accuracy was 99.89% on the time-lapse identification dataset, and 83.55% on the cross-device identification dataset. Besides, we subdivided a cryo-thawed evaluation set from the cross-device test set to have a better estimation of how WISE performs in the real-world, and it reached an accuracy of 82.22%. There were approximately 10% improvements in cross-device and cryo-thawed identification tasks after the SSL method was applied. Besides, WISE demonstrated improvements in the accuracy of 9.5%, 12%, and 18% over embryologists in the three scenarios. SSL methods can improve embryo identification accuracy even when dealing with cross-device and cryo-thawed paired images. The study is the first to apply SSL in embryo identification, and the results show the promise of WISE for future application in embryo witnessing.
- Research Article
7
- 10.1016/j.ins.2022.11.022
- Nov 14, 2022
- Information Sciences
Many self-supervised representation learning methods have achieved high performance in image classification tasks. However, these methods have limited performance on localization tasks such as object detection or semantic segmentation. Most self-supervised representation learning methods are optimized with only one global representation, which does not pay much attention to the spatial information in an image. We propose a simple and effective method that uses the positional relationships between the entities in an image by shuffling the convolution kernels. Our method extends current self-supervised learning and calculates the pixel-wise (dis) similarities between the output of the standard convolution kernels and that of the randomly shuffled convolution kernels. Our proposed method achieves higher performance on object detection, instance segmentation, and semantic segmentation when attached to recent self-supervised learning methods.