Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition.

Geon Woo Lee,Hong Kook Kim

doi:10.3390/s24082573

Abstract

This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition.

Abstract

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Journal: Sensors	Publication Date: Apr 17, 2024
License type: cc-by

Similar Papers

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement
Sefik Emre Eskimez ... Hemin Yang
-
Sefik Emre Eskimez, et. al.Sefik Emre Eskimez ... Hemin Yang
30 Aug 2021
30 Aug 2021

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition.
Geon Woo Lee ... Hong Kook Kim
Sensors (Basel, Switzerland) | VOL. 22
Geon Woo Lee, et. al.Geon Woo Lee ... Hong Kook Kim
19 Jul 2022
Sensors (Basel, Switzerland) | VOL. 22

Adversarial Examples Protect Your Privacy on Speech Enhancement System
Mingyu Dong ... Rangding Wang
Computer Systems Science and Engineering | VOL. 46
Mingyu Dong, et. al.Mingyu Dong ... Rangding Wang
01 Jan 2023
Computer Systems Science and Engineering | VOL. 46

Efficient Audio-Visual Speech Enhancement Using Deep U-Net With Early Fusion of Audio and Video Information and RNN Attention Blocks
Jung-Wook Hwang ... Hyung-Min Park
IEEE Access | VOL. 9
Jung-Wook Hwang, et. al.Jung-Wook Hwang ... Hyung-Min Park
01 Jan 2020
IEEE Access | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition.

Abstract

Talk to us

Similar Papers

More From: Sensors