Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Ryandhimas E Zezario,Yu Tsao,Chiou-Shann Fuh,Fei Chen,Hsin-Min Wang,Szu-Wei Fu

doi:10.1109/taslp.2022.3205757

Abstract

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2023
Citations: 26	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

New research on monaural speech segregation based on quality assessment
Xiaoping Xie ... Fei Ding
Computer Speech & Language | VOL. 85
Xiaoping Xie, et. al.Xiaoping Xie ... Fei Ding
05 Dec 2023
Computer Speech & Language | VOL. 85

Two-Stage Deep Learning Approach for Speech Enhancement and Reconstruction in The Frequency and Time Domains
Soha A Nossier ... Julie Wall
-
Soha A Nossier, et. al.Soha A Nossier ... Julie Wall
18 Jul 2022
18 Jul 2022

Multiresolution Speech Enhancement Based on Proposed Circular Nested Microphone Array in Combination with Sub-Band Affine Projection Algorithm
Ali Dehghan Firoozabadi ... Hugo Durney
Applied Sciences | VOL. 10
Ali Dehghan Firoozabadi, et. al.Ali Dehghan Firoozabadi ... Hugo Durney
06 Jun 2020
Applied Sciences | VOL. 10

A Conditional Generative Model for Speech Enhancement
Zeng-Xi Li ... Li-Rong Dai
Circuits, Systems, and Signal Processing | VOL. 37
Zeng-Xi Li, et. al.Zeng-Xi Li ... Li-Rong Dai
13 Mar 2018
Circuits, Systems, and Signal Processing | VOL. 37

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing