Self-supervised learning using unlabeled speech with multiple types of speech disorder for disordered speech recognition

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

This paper investigates a training method of an automatic speech recognition (ASR) model for people with speech disorders. Because the characteristics of their speech differ significantly from those of the typical speech, in order to recognize the speech of a user with a disorder, the system needs to be trained with the user’s speech in advance. However, recording speech from people with disorders is a large burden for them, and therefore, it is difficult to collect a sufficient amount of speech for training. To address this issue, this study investigates the use of two types of speech as training data. The first type is unlabeled speech, which can be easily collected but lacks text labels (e.g., spontaneous speech in daily life). To utilize the unlabeled speech for training an ASR model, a self-supervised learning approach is employed. The second type involves utilizing speech data from individuals with different types of speech disorders. In our system, besides the user’s speech, the speech of individuals with the same type of disorder and even different types of disorders is also incorporated. Experimental results demonstrated that using unlabeled speech and speech from multiple types of disorders led to reduced recognition error rates.

Similar Papers
  • Conference Article
  • Cite Count Icon 57
  • 10.21437/interspeech.2021-556
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech
  • Aug 30, 2021
  • Solène Evain + 17 more

Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce dependence on labeled data for building efficient speech systems, their evaluation was mostly made on ASR and using multiple and heterogeneous experimental settings (most of them for English). This questions the objective comparison of SSL approaches and the evaluation of their impact on building speech systems. In this paper, we propose LeBenchmark: a reproducible framework for assessing SSL from speech. It not only includes ASR (high and low resource) tasks but also spoken language understanding, speech translation and emotion recognition. We also focus on speech technologies in a language different than English: French. SSL models of different sizes are trained from carefully sourced and documented datasets. Experiments show that SSL is beneficial for most but not all tasks which confirms the need for exhaustive and reliable benchmarks to evaluate its real impact. LeBenchmark is shared with the scientific community for reproducible research in SSL from speech.

  • Conference Article
  • Cite Count Icon 46
  • 10.21437/interspeech.2020-1835
Investigating Self-Supervised Pre-Training for End-to-End Speech Translation
  • Oct 25, 2020
  • Ha Nguyen + 4 more

Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR). We investigate here its impact on end-to-end automatic speech translation (AST) performance. We use a contrastive predic-tive coding (CPC) model pre-trained from unlabeled speech as a feature extractor for a downstream AST task. We show that self-supervised pre-training is particularly efficient in low resource settings and that fine-tuning CPC models on the AST training data further improves performance. Even in higher resource settings, ensembling AST models trained with filter-bank and CPC representations leads to near state-of-the-art models without using any ASR pre-training. This might be particularly beneficial when one needs to develop a system that translates from speech in a language with poorly standardized orthography or even from speech in an unwritten language. Index Terms: self-supervised learning from speech, automatic speech translation, end-to-end models, low resource settings.

  • Conference Article
  • Cite Count Icon 4
  • 10.21437/interspeech.2021-1027
Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning
  • Aug 30, 2021
  • Salah Zaiem + 2 more

Through solving pretext tasks, self-supervised learning (SSL) leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. A common pretext task consists in pretraining a SSL model on pseudo-labels derived from the original signal. This technique is particularly relevant for speech data where various meaningful signal processing features may serve as pseudo-labels. However, the process of selecting pseudo-labels, for speech or other types of data, remains mostly unexplored and currently relies on observing the results on the final downstream task. Nevertheless, this methodology is not sustainable at scale due to substantial computational (hence carbon) costs. Thus, this paper introduces a practical and theoretical framework to select relevant pseudo-labels with respect to a given downstream task. More precisely, we propose a functional estimator of the pseudo-label utility grounded in the conditional independence theory, which does not require any training. The experiments conducted on speaker recognition and automatic speech recognition validate our estimator, showing a significant correlation between the performance observed on the downstream task and the utility estimates obtained with our approach, facilitating the prospection of relevant pseudo-labels for self-supervised speech representation learning.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icpr48806.2021.9413295
Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
  • Jan 10, 2021
  • Mani Kumar Tellamekala + 3 more

Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce ‘Audio-Visual Permutative Predictive Coding’ (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 80.30% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.

  • Conference Article
  • Cite Count Icon 7
  • 10.21437/interspeech.2022-11338
Joint Encoder-Decoder Self-Supervised Pre-training for ASR
  • Sep 18, 2022
  • A Arunkumar + 1 more

Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tasks, including Automatic Speech Recognition (ASR). The output embeddings of the SSL model are treated as powerful short-time representations of the speech signal. However, in the ASR task, the main objective is to get the correct sequence of acoustic units, characters, or byte-pair encodings (BPEs). Usually, encoder-decoder architecture works exceptionally well for a sequence-to-sequence task like ASR. Therefore, in this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning. We use Hidden Unit BERT (HuBERT) SSL framework to compute the conventional masked prediction loss for the encoder. In addition, we have introduced a decoder in the SSL framework and proposed a target preparation strategy for the decoder. Finally, we use a multitask SSL setup wherein we jointly optimize both the encoder and decoder losses. We hypothesize that the presence of a decoder in the SSL model helps it learn an acoustic unit-based language model, which might improve the performance of an ASR downstream task. We compare our proposed SSL model with HuBERT and show up to 25% relative improvement in performance on ASR by finetuning on various LibriSpeech subsets.

  • Conference Article
  • Cite Count Icon 16
  • 10.21437/interspeech.2022-10796
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation
  • Sep 18, 2022
  • Dan Berrebbi + 5 more

Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks are hand-crafted non-learnable components, and could be more robust to domain shifts. The present work examines the assumption that combining non-learnable SF extractors to SSL models is an effective approach to low resource speech tasks. We propose a learnable and interpretable framework to combine SF and SSL representations. The proposed framework outperforms significantly both baseline and SSL models on Automatic Speech Recognition (ASR) and Speech Translation (ST) tasks on three low resource datasets. We additionally design a mixture of experts based combination model. This last model reveals that the relative contribution of SSL models over conventional SF extractors is very small in case of domain mismatch between SSL training set and the target language data.

  • Conference Article
  • Cite Count Icon 23
  • 10.21437/interspeech.2022-11128
DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children’s ASR
  • Sep 18, 2022
  • Ruchao Fan + 1 more

Self-supervised learning (SSL) in the pretraining stage using un-annotated speech data has been successful in low-resource automatic speech recognition (ASR) tasks. However, models trained through SSL are biased to the pretraining data which is usually different from the data used in finetuning tasks, causing a domain shifting problem, and thus resulting in limited knowledge transfer. We propose a novel framework, domain responsible adaptation and finetuning (DRAFT), to reduce domain shifting in pretrained speech models through an additional adaptation stage. In DRAFT, residual adapters (RAs) are inserted in the pretrained model to learn domain-related information with the same SSL loss as the pretraining stage. Only RA parameters are updated during the adaptation stage. DRAFT is agnostic to the type of SSL method used and is evaluated with three widely used approaches: APC, Wav2vec2.0, and HuBERT. On two child ASR tasks (OGI and MyST databases), using SSL models trained with un-annotated adult speech data (Librispeech), relative WER improvements of up to 19.7% are observed when compared to the pretrained models without adaptation. Additional experiments examined the potential of cross knowledge transfer between the two datasets and the results are promising, showing a broader usage of the proposed DRAFT framework.

  • Research Article
  • 10.30574/wjaets.2023.10.1.0279
Self-Supervised Learning in AI: Transforming data efficiency and model generalization in machine learning
  • Oct 30, 2023
  • World Journal of Advanced Engineering Technology and Sciences
  • Sharmin Nahar + 5 more

Self-supervised learning (SSL) represents a revolutionary AI paradigm which lets machines acquire significant data representations directly from unlabeled information through unsupervised learning approaches. SSL uses contrastive learning and masked data modeling and predictive learning approaches to optimize data efficiency thereby improving model generalization between multiple domains. This paper evaluates the core concepts of SSL alongside its superiority to supervised and unsupervised learning and its usage in different fields such as NLP, computer vision, speech recognition, healthcare, finance and robotics. The paper focuses on analysis of essential techniques and architectures which include SimCLR, MoCo, BERT, MAE, BYOL and approaches combining SSL with reinforcement learning and weak supervision methods. The research analyzes SSL's current challenges including operational expenses and representation degeneration as well as the assessment obstructions while proposing future uses for the method in mixed-data learning and minimal-resource contexts and artificial general intelligence (AGI). The adoption of SSL in real-world AI applications depends on effectively dealing with ethical matters that include bias issues and responsible AI practices and fairness assurance.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icassp49357.2023.10096318
Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models
  • Jun 4, 2023
  • Takanori Ashihara + 3 more

Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings. However, since the two settings have been studied individually in general, there has been little research focusing on how effective a cross-lingual model is in comparison with a monolingual model. In this paper, we investigate this fundamental question empirically with Japanese automatic speech recognition (ASR) tasks. First, we begin by comparing the ASR performance of cross-lingual and monolingual models for two different language tasks while keeping the acoustic domain as identical as possible. Then, we examine how much unlabeled data collected in Japanese is needed to achieve performance comparable to a cross-lingual model pre-trained with tens of thousands of hours of English and/or multilingual data. Finally, we extensively investigate the effectiveness of SSL in Japanese and demonstrate state-of-the-art performance on multiple ASR tasks. Since there is no comprehensive SSL study for Japanese, we hope this study will guide Japanese SSL research.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/wispnet54241.2022.9767118
End-to-End Speech Recognition for Low Resource Language Sanskrit using Self-Supervised Learning
  • Mar 24, 2022
  • S Shashank Holla + 4 more

We are presenting the work on building a speaker independent, continuous speech recognition system for Samskruta (also called Sanskrit) using self-supervised learning. We have used a Pre-trained model from the Vakyansh team where the model is trained using 10,000 Hrs of data with 23 Indic languages and Fine-tuned it using a data-set containing nearly 78 Hrs of Samskruta audio along with their transcription taken from Vaksancaya - Sanskrit Speech Corpus from IIT Bombay. Acoustic representations are learned in an end-to-end deep learning approach using the wav2vec2.0 architecture from Fairseq. On top of this acoustic model, a language model is used to increase the overall performance. Our system provides a word error rate (WER) of 5.1 % on test data and 2.4% on train data. Meanwhile we built a graphical user interface in the form of a web page using the Flask framework, which provides an interactive platform for the user to record audio and see the transcription in real-time. To the best of our knowledge, our approach using self-supervised learning, gives better performance compared to the state of the art methods.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/icassp49357.2023.10096308
HuBERT-AGG: Aggregated Representation Distillation of Hidden-Unit Bert for Robust Speech Recognition
  • Jun 4, 2023
  • Wei Wang + 1 more

Self-supervised learning (SSL) has attracted widespread research interest since many successful SSL approaches such as wav2vec 2.0 and Hidden-unit BERT (HuBERT) have achieved promising results on speech-related tasks such as automatic speech recognition (ASR). However, few works have been conducted to improve the noise robustness of SSL models. In this paper, we propose HuBERT-AGG, a novel method that learns noise-invariant SSL representations for robust speech recognition by distilling aggregated layer-wise representations. Specifically, we learn an aggregator that computes the weighted sum of all hidden states of a pretrained vanilla Hu-BERT by fine-tuning it on a small portion of labeled data. Then a noise-robust HuBERT is trained on the simulated noisy speech by distilling from the aggregated representations and layer-wise hidden states produced by a pretrained vanilla HuBERT with parallel original speech as input. Experiments on libriSpeech simulated noisy test sets show 13.1%-17.0% relative word error rate (WER) reduction with very slight degradation on the original test sets. On CHiME-4 1-channel real speech test sets, we have surpassed the best results achieved by all published fully supervised ASR models as well as other SSL approaches adopting the same data usage as ours.

  • Research Article
  • 10.11834/jig.211182
Multi-layer adaptive aggregation self-supervised few-shot learning image classification
  • Jan 1, 2023
  • Journal of Image and Graphics
  • Lyu Jia + 1 more

目的 在图像分类领域,小样本学习旨在利用从大规模数据集中训练到的知识来处理仅包含少量有标记训练样本的下游分类任务。通常情况下,下游任务只涉及新类样本,由于元训练阶段会构造大量任务随机抽取训练集中不同类别的样本且训练集与测试集类别间存在领域间隙,因此模型训练周期长且可能对训练集过拟合,以致元知识无法迁移到测试集,进而导致模型泛化性差。针对以上问题,提出一种多层自适应聚合的自监督小样本图像分类模型。方法 首先使用分组卷积对残差块进行改进,减少神经网络参数量,降低训练难度,缩短训练时间;然后采用多层自适应聚合的方法改进骨干网络,对网络各层语义信息加以提炼聚合,自适应分配各层权重,将聚合后的特征图作为后续分类的依据;最后加入自监督对比学习结合有监督学习挖掘样本自身潜在的信息,从而提升样本特征表达能力。结果 在mini-ImageNet数据集和CUB(Caltech-UCSD birds-200-2011)数据集上与当前主流模型进行分类效果对比实验,与baseline相比,所提模型的准确率在mini-ImageNet数据集的5-way 1-shot与5-way 5-shot实验上分别提升了6.31%和6.04%,在CUB数据集的5-way 1-shot与5-way 5-shot实验上分别提升了8.95%和8.77%。结论 本文模型能在一定程度上缩短训练时间、增强样本特征表达能力和优化数据分布,并缓解领域间隙所带来的问题,从而提高模型泛化性与分类效果。;Objective The emerging deep learning technique has facilitated such artificial intelligence (AI)-related domains like image classification,natural language processing,speech recognition,and reinforcement learning. However, it is being challenged for the over-fitting problem of the models. Effective data is required to be obtained from mass,especially in the tasks of image classification. To tackle this problem,the concept of few-shot learning is developed,which aims at well-generalized knowledge learned from a large-scale dataset to handle downstream classification tasks with little training samples. Currently,most popular methods for few-shot image classification are based on meta-learning,which can be learnt to deal with few-shot tasks via similar classification tasks. The process of meta-learning is divided into two steps: 1)meta training,and 2)meta testing. For the meta training,an embedding network is trained by the meta-training set, and it is used to tackle a few training data-constructed downstream classification tasks from the meta-testing set. There is no intersection between meta-training set and meta-testing set,which means that no reliable prior knowledge is obtained by meta-learner in the meta-training process. Due to category differences between meta-training and meta-testing,some new challenges to meta-learning models are required to be resolved. If it focuses on training tasks only,the effectiveness of models will be affected when the meta-learner meets with the few-shot tasks with brand-new categories. To tackle this challenge with metric-based methods,we develop a multi-layer adaptive aggregation self-supervised few-shot classification model. Method First,to reduce the parameters of the backbone and lower the training difficulty,a group of convolution blocks are used to replace the original convolution. Next,to improve the backbone,the multi-layer adaptive aggregation module is illustrated,which can refine the information of each network layer dynamically and balance the weights of each layer adaptively via the aggregated feature maps of those are the basis for subsequent downstream few-shot classification. Finally,to enhance the transferability of the learned model,the self-supervised contrastive learning is introduced to assist supervised learning to mine the potential information of the data themselves. It does not suffer from over-fitting because contrastive learning is not required to be supervised. It can be as an additional source of regularization as well,which is beneficial for the construction of feature space. The embedding networks can be paid more attention to learn the well-generalized knowledge in terms of the proposed self-supervised contrastive learning method,which makes the distribution of embedding feature maps smoother and the classification model is more suitable for the domain of downstream tasks. Result To validate the effectiveness of the proposed model,comparative analysis is carried out with some popular models,including 1)prototype network, 2)relation network,3)cosine classifier,as well as the 4)mini-ImageNet dataset and 5)Caltech-UCSD birds-200-2011 (CUB)dataset. For the mini-ImageNet dataset,each accuracy of the proposed model can be reached to 63. 13% on 5-way 1-shot and 78. 14% on 5-way 5-shot,it can be optimized by 13. 71% and 9. 94% each for the original Prototype network. For the fine-grained CUB dataset,the accuracies of the proposed model can be reached to 75. 93% on 5-way 1-shot and 87. 56% on 5-way 5-shot,which are 24. 48% and 13. 05% higher than the original Prototype network of each. Compared to the baseline on 5-way 1-shot and 5-way 5-shot,each of model accuracy is increased by 6. 31%,6. 04% on mini-ImageNet, and they are increased by 8. 95%,8. 77% of each as well on CUB. The comparative experiments demonstrate that the parameters of our backbone are optimized in comparison with the parameters of 5 backbones on Prototype network. A couple of ablation experiments are also conducted to verify the proposed model. Additionally,the heat maps-related contrastive experiment between baseline and the proposed model verifies that our model can prevent the embedding network from more background information of images and alleviate the interference of downstream classification tasks. Furthermore,the t-SNE method is used for visualization to sort the distribution of samples out in the feature space. The obtained feature distribution of t-SNE visualization experiment on CUB dataset can demonstrate that our model is capable of differentiating samples well from different categories,which can make the meta-testing set linearly separable. Conclusion To resolve some problems in the field of few-shot learning,we develop a multi-layer adaptive aggregation self-supervised few-shot classification model. To alleviate the problem of training difficulty,the improved group convolution can be used to reduce the parameters of backbone. To optimize over-fitting and domain gap,the multi-layer adaptive aggregation method and the self-supervised contrastive learning method are used to adjust the distribution of embedding feature maps. In particular,the embedding networks are not be affected by the background-redundant of images based on our self-supervised contrastive learning method.

  • Conference Article
  • Cite Count Icon 35
  • 10.21437/interspeech.2022-10043
Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition
  • Sep 18, 2022
  • Lester Phillip Violeta + 2 more

We investigate the performance of self-supervised pretraining frameworks on pathological speech datasets used for automatic speech recognition (ASR). Modern end-to-end models require thousands of hours of data to train well, but only a small number of pathological speech datasets are publicly available. A proven solution to this problem is by first pretraining the model on a huge number of healthy speech datasets and then fine-tuning it on the pathological speech datasets. One new pretraining framework called self-supervised learning (SSL) trains a network using only speech data, providing more flexibility in training data requirements and allowing more speech data to be used in pretraining. We investigate SSL frameworks such as the wav2vec 2.0 and WavLM models using different setups and compare their performance with different supervised pretraining setups, using two types of pathological speech, namely, Japanese electrolaryngeal and English dysarthric. Our results show that although SSL has shown success with minimally resourced healthy speech, we do not find this to be the case with pathological speech. The best supervised setup outperforms the best SSL setup by 13.9% character error rate in electrolaryngeal speech and 16.8% word error rate in dysarthric speech.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icassp49357.2023.10094787
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition
  • Jun 4, 2023
  • Zili Huang + 8 more

Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applications, with limited investigations in multi-talker cases, especially for streaming scenarios. In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion. Firstly, we observe that conventional SSL techniques do not work well on this task due to the poor representation of overlapping speech. We then propose a novel SSL training objective, referred to as bi-label masked speech prediction, which explicitly preserves representations of all speakers in overlapping speech. We investigate various aspects of the proposed system, including data configuration and quantizer selection. The proposed SSL setup achieves substantially better word error rates on the LibriSpeechMix dataset.

  • Conference Article
  • Cite Count Icon 71
  • 10.1109/icassp43922.2022.9747077
Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training
  • May 23, 2022
  • Sanyuan Chen + 10 more

Self-supervised learning (SSL) is a long-standing goal for speech processing, since it utilizes large-scale unlabeled data and avoids extensive human labeling. Recent years have witnessed great successes in applying self-supervised learning in speech recognition, while limited exploration was attempted in applying SSL for modeling speaker characteristics. In this paper, we aim to improve the existing SSL framework for speaker representation learning. Two methods are introduced for enhancing the unsupervised speaker information extraction. First, we apply multi-task learning to the current SSL framework, where we integrate utterance-wise contrastive loss with the SSL objective function. Second, for better speaker discrimination, we propose an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporated during training. We integrate the proposed methods into the HuBERT framework. Experiment results on the SUPERB benchmark show that the proposed system achieves state-of-the-art performance in universal representation learning, especially for speaker identification oriented tasks. An ablation study is performed verifying the efficacy of each proposed method. Finally, we scale up the training dataset to 94 thousand hours of public audio data and achieve further performance improvement in all SUPERB tasks.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant