Is self-supervised learning enough to fill in the gap? A study on speech inpainting
Is self-supervised learning enough to fill in the gap? A study on speech inpainting
- Research Article
- 10.12688/wellcomeopenres.17148.2
- Feb 1, 2023
- Wellcome Open Research
Background: The success of many machine learning applications depends on knowledge about the relationship between the input data and the task of interest (output), hindering the application of machine learning to novel tasks. End-to-end deep learning, which does not require intermediate feature engineering, has been recommended to overcome this challenge but end-to-end deep learning models require large labelled training data sets often unavailable in many medical applications. In this study, we trained self-supervised learning (SSL) models for automatic feature extraction from raw photoplethysmography (PPG) obtained using a pulse oximeter, with the aim of predicting paediatric hospitalization. Methods: We compared logistic regression models fitted using features extracted using SSL with models trained using both clinical and SSL features. In addition, we compared end-to-end deep learning models initialized randomly or using weights from the SSL models. We also compared the performance of SSL models trained on labelled data alone (n=1,031) with SSL trained using both labelled and unlabelled signals (n=7,578). Results: Logistic regression models were more predictive of hospitalization when trained on features extracted using labelled PPG signals only compared to SSL models trained on both labelled and unlabelled signals (AUC 0.83 vs 0.80). However, features extracted using SSL model trained on both labelled and unlabelled PPG signals were more predictive of hospitalization when concatenated with clinical features (AUC 0.89 vs 0.87). The end-to-end deep learning model had an AUC of 0.80 when initialized using the SSL model trained on all PPG signals, 0.77 when initialized using SSL trained on labelled data only, and 0.73 when initialized randomly. Conclusions: This study shows that SSL can extract features from PPG signals that are predictive of hospitalization or initialize end-to-end deep learning models. Furthermore, SSL can leverage larger unlabelled data sets to improve performance of models fitted using small labelled data sets.
- Research Article
- 10.12688/wellcomeopenres.17148.1
- Sep 28, 2021
- Wellcome Open Research
Background: The success of many machine learning applications depends on knowledge about the relationship between the input data and the task of interest (output), hindering the application of machine learning to novel tasks. End-to-end deep learning, which does not require intermediate feature engineering, has been recommended to overcome this challenge but end-to-end deep learning models require large labelled training data sets often unavailable in many medical applications. In this study, we trained machine learning models to predict paediatric hospitalization given raw photoplethysmography (PPG) signals obtained from a pulse oximeter. We trained self-supervised learning (SSL) for automatic feature extraction from PPG signals and assessed the utility of SSL in initializing end-to-end deep learning models trained on a small labelled data set with the aim of predicting paediatric hospitalization.Methods: We compared logistic regression models fitted using features extracted using SSL with end-to-end deep learning models initialized either randomly or using weights from the SSL model. We also compared the performance of SSL models trained on labelled data alone (n=1,031) with SSL trained using both labelled and unlabelled signals (n=7,578). Results: The SSL model trained on both labelled and unlabelled PPG signals produced features that were more predictive of hospitalization compared to the SSL model trained on labelled PPG only (AUC of logistic regression model: 0.78 vs 0.74). The end-to-end deep learning model had an AUC of 0.80 when initialized using the SSL model trained on all PPG signals, 0.77 when initialized using SSL trained on labelled data only, and 0.73 when initialized randomly. Conclusions: This study shows that SSL can improve the classification of PPG signals by either extracting features required by logistic regression models or initializing end-to-end deep learning models. Furthermore, SSL can leverage larger unlabelled data sets to improve performance of models fitted using small labelled data sets.
- Research Article
10
- 10.1016/j.neucom.2022.10.076
- Nov 4, 2022
- Neurocomputing
Considering three elements of aesthetics: Multi-task self-supervised feature learning for image style classification
- Conference Article
7
- 10.1109/icct52962.2021.9657922
- Oct 13, 2021
Electroencephalogram (EEG) is widely utilized in emotion recognition because of its exceptional stability and high detection accuracy. However, large amounts of labeled EEG data are difficult to come by. Self-supervised representation learning with multi-transformation tasks is presented as an innovative solution for emotion recognition. The solution consists of two tasks: self-supervised representation learning and emotion recognition. Self-supervised learning is applied to learn high-level EEG representation from unlabeled data. Representation learning contains six different transformations to learn the high-level EEG representations comprehensively: noising, scaling, negating, horizontally flipping, permuting, and time-warping. Then the self-supervised network can recognize different EEG representations, after that the weights of convolutional layers are frozen and transferred to the emotion recognition network, and the ability to distinguish EEG is transferred too. This is the first work that self-supervised learning that has been used for emotion recognition using EEG signals to the best of our knowledge. The accuracy we achieved is 98.64% that higher than all known fully supervised methods, and self-supervised learning saves a tremendous amount of time for labeling data. This result is state-of-the-art until now. Our experiments prove that the application of self-supervised learning in EEG-based emotion recognition is feasible and effective.
- Conference Article
2
- 10.1109/aiid51893.2021.9456562
- May 28, 2021
- Control theory & applications
Self-supervised learning can be adopted to mine deep semantic information of visual data without a large number of human-annotated supervision by using a pretext task to pretrain a model. In this study, we proposed a novel self-supervised learning paradigm, namely multi-task self-supervised (MTSS) representation learning. Unlike existing self-supervised learning methods, which pretrain neural networks on the pretext task and then fine-tune the parameters of neural networks on the downstream task, in our scheme, downstream and pretext tasks are considered primary and auxiliary tasks, respectively, and are trained simultaneously. Our method involves maximizing the similarity of two augmented views of an image as an auxiliary task and using a multi-task network to train the primary task alongside the auxiliary task. We evaluated the proposed method on standard datasets and backbones through a rigorous experimental procedure. Experimental results revealed that proposed MTSS can achieve better performance and robustness than other self-supervised learning methods on multiple image classification data sets without using negative sample pairs and large batches. This simple yet effective method can inspire people to rethink self-supervised learning.
- Research Article
2
- 10.26689/jera.v5i3.2320
- Aug 17, 2021
- Journal of Electronic Research and Application
In recent years, self-supervised learning which does not require a large number of manual labels generate supervised signals through the data itself to attain the characterization learning of samples. Self-supervised learning solves the problem of learning semantic features from unlabeled data, and realizes pre-training of models in large data sets. Its significant advantages have been extensively studied by scholars in recent years. There are usually three types of self-supervised learning: “Generative, Contrastive, and Generative-Contrastive.” The model of the comparative learning method is relatively simple, and the performance of the current downstream task is comparable to that of the supervised learning method. Therefore, we propose a conceptual analysis framework: data augmentation pipeline, architectures, pretext tasks, comparison methods, semi-supervised fine-tuning. Based on this conceptual framework, we qualitatively analyze the existing comparative self-supervised learning methods for computer vision, and then further analyze its performance at different stages, and finally summarize the research status of self-supervised comparative learning methods in other fields.
- Book Chapter
8
- 10.5772/intechopen.104785
- Dec 21, 2022
Although its origins date a few decades back, contrastive learning has recently gained popularity due to its achievements in self-supervised learning, especially in computer vision. Supervised learning usually requires a decent amount of labeled data, which is not easy to obtain for many applications. With self-supervised learning, we can use inexpensive unlabeled data and achieve a training on a pretext task. Such a training helps us to learn powerful representations. In most cases, for a downstream task, self-supervised training is fine-tuned with the available amount of labeled data. In this study, we review common pretext and downstream tasks in computer vision and we present the latest self-supervised contrastive learning techniques, which are implemented as Siamese neural networks. Lastly, we present a case study where self-supervised contrastive learning was applied to learn representations of semantic masks of images. Performance was evaluated on an image retrieval task and results reveal that, in accordance with the findings in the literature, fine-tuning the self-supervised training showed the best performance.
- Preprint Article
- 10.20944/preprints202502.1894.v1
- Feb 24, 2025
Self-supervised learning (SSL) has emerged as a transformative paradigm in machine learning, enabling models to learn meaningful representations from vast amounts of unlabeled data. By leveraging pretext tasks that generate supervisory signals intrinsically from data, SSL has significantly reduced the need for costly human annotations and has demonstrated remarkable performance across diverse domains, including computer vision, natural language processing, speech processing, robotics, and healthcare. This survey provides a comprehensive overview of self-supervised learning, covering its fundamental principles, major methodological approaches, and real-world applications. We categorize SSL into four primary paradigms: contrastive learning, clustering-based learning, generative modeling, and predictive learning. We discuss the theoretical underpinnings of these approaches, highlight their strengths and limitations, and analyze their impact on downstream tasks. Additionally, we explore the integration of SSL with deep learning architectures and its role in improving model generalization, robustness, and efficiency. Despite its successes, SSL faces several challenges, including the computational cost of large-scale training, sensitivity to domain shifts, difficulties in designing optimal pretext tasks, and a lack of theoretical understanding. We outline open research questions and promising future directions, such as multimodal SSL, efficient pretraining techniques, self-supervised reinforcement learning, and fairness-aware SSL. As self-supervised learning continues to evolve, it holds the potential to redefine machine learning by enabling more scalable, efficient, and generalizable models. This survey aims to provide researchers and practitioners with a comprehensive understanding of SSL, facilitating further advancements in this rapidly growing field.
- Conference Article
27
- 10.21437/interspeech.2021-556
- Aug 30, 2021
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce dependence on labeled data for building efficient speech systems, their evaluation was mostly made on ASR and using multiple and heterogeneous experimental settings (most of them for English). This questions the objective comparison of SSL approaches and the evaluation of their impact on building speech systems. In this paper, we propose LeBenchmark: a reproducible framework for assessing SSL from speech. It not only includes ASR (high and low resource) tasks but also spoken language understanding, speech translation and emotion recognition. We also focus on speech technologies in a language different than English: French. SSL models of different sizes are trained from carefully sourced and documented datasets. Experiments show that SSL is beneficial for most but not all tasks which confirms the need for exhaustive and reliable benchmarks to evaluate its real impact. LeBenchmark is shared with the scientific community for reproducible research in SSL from speech.
- Book Chapter
2
- 10.1007/978-3-031-00126-0_29
- Jan 1, 2022
Self-supervised representation learning of Multivariate Time Series (MTS) is a challenging task and attracts increasing research interests in recent years. Many previous works focus on the pretext task of self-supervised learning and usually neglect the complex problem of MTS encoding, leading to unpromising results. In this paper, we tackle this challenge from two aspects: encoder and pretext task, and propose a unified channel-aware self-supervised learning framework CaSS. Specifically, we first design a new Transformer-based encoder Channel-aware Transformer (CaT) to capture the complex relationships between different time channels of MTS. Second, we combine two novel pretext tasks Next Trend Prediction (NTP) and Contextual Similarity (CS) for the self-supervised representation learning with our proposed encoder. Extensive experiments are conducted on several commonly used benchmark datasets. The experimental results show that our framework achieves new state-of-the-art comparing with previous self-supervised MTS representation learning methods (up to +7.70% improvement on LSST dataset) and can be well applied to the downstream MTS classification.
- Research Article
38
- 10.3390/e26030252
- Mar 12, 2024
- Entropy
Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory has shaped deep neural networks, particularly the information bottleneck principle. This principle optimizes the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an information-theoretic perspective, introducing a unified framework that encapsulates the self-supervised information-theoretic learning problem. This framework includes multiple encoders and decoders, suggesting that all existing work on self-supervised learning can be seen as specific instances. We aim to unify these approaches to understand their underlying principles better and address the main challenge: many works present different frameworks with differing theories that may seem contradictory. By weaving existing research into a cohesive narrative, we delve into contemporary self-supervised methodologies, spotlight potential research areas, and highlight inherent challenges. Moreover, we discuss how to estimate information-theoretic quantities and their associated empirical problems. Overall, this paper provides a comprehensive review of the intersection of information theory, self-supervised learning, and deep neural networks, aiming for a better understanding through our proposed unified approach.
- Research Article
1511
- 10.1109/tpami.2020.2992393
- May 4, 2020
- IEEE Transactions on Pattern Analysis and Machine Intelligence
Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used datasets for images, videos, audios, and 3D data, as well as the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.
- Research Article
847
- 10.1109/tkde.2021.3090866
- Jan 1, 2021
- IEEE Transactions on Knowledge and Data Engineering
Deep supervised learning has achieved great success in the last decade. However, its deficiencies of dependence on manual labels and vulnerability to attacks have driven people to explore a better solution. As an alternative, self-supervised learning attracts many researchers for its soaring performance on representation learning in the last several years. Self-supervised representation learning leverages input data itself as supervision and benefits almost all types of downstream tasks. In this survey, we take a look into new self-supervised learning methods for representation in computer vision, natural language processing, and graph learning. We comprehensively review the existing empirical methods and summarize them into three main categories according to their objectives: generative, contrastive, and generative-contrastive (adversarial). We further investigate related theoretical analysis work to provide deeper thoughts on how self-supervised learning works. Finally, we briefly discuss open problems and future directions for self-supervised learning. An outline slide for the survey is provided.
- Research Article
- 10.1007/s00521-025-11236-z
- May 14, 2025
- Neural Computing and Applications
Self-supervised learning has emerged as a powerful paradigm for leveraging unlabeled data to learn rich feature representations. However, the efficacy of self-supervised models is often limited by the degree and complexity of the augmentations used during training. In this work, we propose a novel framework that enhances self-supervised learning by incorporating a generative network designed to produce adversarial examples that challenge the learning process. By integrating adversarially generated data, our method extends three well-known self-supervised architectures---SimCLR, BYOL, and SimSiam---and improves their generalization and robustness. We evaluate our approach on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, demonstrating consistent improvements in classification accuracy over baseline models. Notably, our proposed method outperforms standard self-supervised learning techniques, achieving significant gains in top-1 accuracy across all datasets and training epochs. This substantiates our hypothesis that adversarial examples can significantly contribute to the feature learning capabilities of self-supervised models. Furthermore, our findings suggest that the integration of generative networks can serve as a catalyst for the development of more advanced self-supervised learning algorithms. This study lays the groundwork for future research exploring the potential of adversarial training in self-supervised learning and its applications across diverse domains.
- Research Article
13
- 10.3390/app11073043
- Mar 29, 2021
- Applied Sciences
We propose the implementation of transfer learning from natural images to audio-based images using self-supervised learning schemes. Through self-supervised learning, convolutional neural networks (CNNs) can learn the general representation of natural images without labels. In this study, a convolutional neural network was pre-trained with natural images (ImageNet) via self-supervised learning; subsequently, it was fine-tuned on the target audio samples. Pre-training with the self-supervised learning scheme significantly improved the sound classification performance when validated on the following benchmarks: ESC-50, UrbanSound8k, and GTZAN. The network pre-trained via self-supervised learning achieved a similar level of accuracy as those pre-trained using a supervised method that require labels. Therefore, we demonstrated that transfer learning from natural images contributes to improvements in audio-related tasks, and self-supervised learning with natural images is adequate for pre-training scheme in terms of simplicity and effectiveness.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.