Multi-Task Learning U-Net for Single-Channel Speech Enhancement and Mask-Based Voice Activity Detection

Geon Woo Lee,Hong Kook Kim

doi:10.3390/app10093230

Geon Woo Lee, Hong Kook Kim

Open Access

PDF Available

https://doi.org/10.3390/app10093230

Copy DOI

Export

Save

Cite

Journal: Applied Sciences	Publication Date: May 6, 2020
Citations: 27	License type: CC BY 4.0

Affiliation: Gwangju Institute of Science and Technology

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

In this paper, a multi-task learning U-shaped neural network (MTU-Net) is proposed and applied to single-channel speech enhancement (SE). The proposed MTU-based SE method estimates an ideal binary mask (IBM) or an ideal ratio mask (IRM) by extending the decoding network of a conventional U-Net to simultaneously model the speech and noise spectra as the target. The effectiveness of the proposed SE method was evaluated under both matched and mismatched noise conditions between training and testing by measuring the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). Consequently, the proposed SE method with IRM achieved a substantial improvement with higher average PESQ scores by 0.17, 0.52, and 0.40 than other state-of-the-art deep-learning-based methods, such as the deep recurrent neural network (DRNN), SE generative adversarial network (SEGAN), and conventional U-Net, respectively. In addition, the STOI scores of the proposed SE method are 0.07, 0.05, and 0.05 higher than those of the DRNN, SEGAN, and U-Net, respectively. Next, voice activity detection (VAD) is also proposed by using the IRM estimated by the proposed MTU-Net-based SE method, which is fundamentally an unsupervised method without any model training. Then, the performance of the proposed VAD method was compared with the performance of supervised learning-based methods using a deep neural network (DNN), a boosted DNN, and a long short-term memory (LSTM) network. Consequently, the proposed VAD methods show a slightly better performance than the three neural network-based methods under mismatched noise conditions.

Highlights

Speech enhancement (SE) has been widely used as a preprocessing step in speech-related tasks, such as automatic speech recognition, speaker recognition, hearing aids, and enhanced mobile communication
(k), speech signal is estimated by applying an such as: ideal ratio mask (IRM) is multiplied to the noisy input speech spectrum, which results in the estimation of the clean speech spectrum; the estimated clean speech in the time domain is reconstructed by applying an inverse FFT (IFFT) H (k)|Y (k)| exp j∠Y (k)
This section first evaluates the performance of the proposed multi-task learning U-shaped neural network (MTU-Net)-based SE method and compares it with those of several conventional SE methods based on sparse NMF (SNMF) [Error! Reference compares it with those of several conventional SE methods based on SNMF [3], SE generative adversarial network (SEGAN) [8], deep recurrent neural network (DRNN) [5], source not found.], SEGAN [Error! Reference source not found.], DRNN [Error! Reference source and U-Net [15]

Summary

Introduction

Speech enhancement (SE) has been widely used as a preprocessing step in speech-related tasks, such as automatic speech recognition, speaker recognition, hearing aids, and enhanced mobile communication. Sci. 2020, 10, 3230 such as the deep denoising autoencoder [4], deep recurrent NN (DRNN) [5], and convolutional NN (CNN) [6] These methods provide a high signal-to-noise ratio (SNR) due to their good estimated magnitude spectrum matching to clean speech, but the intelligibility of the estimated clean speech is somewhat degraded when using the noisy input speech phase for clean speech estimation. A single-channel SE method is proposed based on a multi-task learning U-Net (MTU-Net) architecture to provide a better estimate of the IRM and to simultaneously perform VAD.

U-Net-Based Speech Enhancement

Each convolution layer with a kernel size

Proposed MTU-Net-Based Speech Enhancement

Model Architecture

Multi-Task Learning

Inference

Method

Performance Evaluation

Experimental Setup

Objective

Objective Quality Evaluation for Voice Activity Detection

Conclusions

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Multi-Task Learning U-Net for Single-Channel Speech Enhancement and Mask-Based Voice Activity Detection

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

A Conditional Generative Model for Speech Enhancement
Zeng-Xi Li ... Li-Rong Dai
Circuits, Systems, and Signal Processing | VOL. 37
Zeng-Xi Li, et. al.Zeng-Xi Li ... Li-Rong Dai
13 Mar 2018
Circuits, Systems, and Signal Processing | VOL. 37

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network.
Zhenqing Li ... Amil Daraz
PloS one | VOL. 19
Zhenqing Li, et. al.Zhenqing Li ... Amil Daraz
03 Jan 2024
PloS one | VOL. 19

Impact of Mask Type as Training Target for Speech Intelligibility and Quality in Cochlear-Implant Noise Reduction
Fergal Henry ... Edward Jones
Sensors | VOL. 24
Fergal Henry, et. al.Fergal Henry ... Edward Jones
14 Oct 2024
Sensors | VOL. 24

A Comparative Study of IBM and IRM Target Mask for Supervised Malay Speech Separation from Noisy Background
Norezmi Jamal ... Shahnoor Shanta
Procedia Computer Science | VOL. 179
Norezmi Jamal, et. al.Norezmi Jamal ... Shahnoor Shanta
01 Jan 2020
Procedia Computer Science | VOL. 179

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Multi-Task Learning U-Net for Single-Channel Speech Enhancement and Mask-Based Voice Activity Detection

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Applied Sciences