Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Bastiaan Tamm,Rik Vandenberghe,Helena Balabin,Hugo Van Hamme

doi:10.21437/interspeech.2022-10147

Abstract

Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable model parameters. We compare our proposed XLS-R-based feature extractor to a Mel-frequency cepstral coefficient (MFCC)-based one, and experiment with various combinations of bidirectional long short term memory (Bi-LSTM) and attention pooling feedforward (AttPoolFF) networks trained on the output of the feature extractors. We demonstrate the increased performance of pre-trained XLS-R embeddings in terms a reduced root mean squared error (RMSE) on the ConferencingSpeech 2022 MOS prediction task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Non-intrusive objective speech quality assessment using a combination of MFCC, PLP and LSF features
Rajesh Kumar Dubey ... Arun Kumar
-
Rajesh Kumar Dubey, et. al.Rajesh Kumar Dubey ... Arun Kumar
01 Dec 2013
01 Dec 2013

Neuroevolution in Deep Neural Networks: Current Trends and Future Challenges
Edgar Galvan ... Peter Mooney
IEEE Transactions on Artificial Intelligence | VOL. 2
Edgar Galvan, et. al.Edgar Galvan ... Peter Mooney
04 May 2021
IEEE Transactions on Artificial Intelligence | VOL. 2

Non-Uniform MCE Training of Deep Long Short-Term Memory Recurrent Neural Networks for Keyword Spotting
Zhong Meng ... Biing-Hwang Juang
-
Zhong Meng, et. al.Zhong Meng ... Biing-Hwang Juang
20 Aug 2017
20 Aug 2017

Non‐intrusive speech quality assessment using multi‐resolution auditory model features for degraded narrowband speech
Rajesh Kumar Dubey ... Arun Kumar
IET Signal Processing | VOL. 9
Rajesh Kumar Dubey, et. al.Rajesh Kumar Dubey ... Arun Kumar
01 Dec 2015
IET Signal Processing | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Abstract

Talk to us

Similar Papers