Improved Lite Audio-Visual Speech Enhancement

Shang-Yi Chuang,Hsin-Min Wang,Yu Tsao

doi:10.1109/taslp.2022.3153265

Abstract

Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario. Compared to conventional AVSE systems, LAVSE requires less online computation and to some extent solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the additional cost of processing visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.

Highlights

S PEECH is the most natural and convenient means for human-human and human-machine communications
We extend the lite audio-visual speech enhancement (LAVSE) system to an improved LAVSE system, which is formed by a multimodal convolutional recurrent neural networks (RNNs) (CRNN) architecture in which the recurrent part is realized by implementing an long short-term memory (LSTM) layer
We proposed the improved LAVSE (iLAVSE) system, which aims to address three issues that may be encountered when developing practical audio-visual multimodal learning for speech enhancement (AVSE) systems, namely the high cost of processing visual data, audio-visual asynchronization, and low-quality visual data

Summary

INTRODUCTION

S PEECH is the most natural and convenient means for human-human and human-machine communications. Various DL-based model structures, including deep denoising autoencoders [51], [52], fully connected neural networks [53], [54], [55], convolutional neural networks (CNNs) [56], [57], recurrent neural networks (RNNs), and long short-term memory (LSTM) [58], [59], [60], [61], [62], [63], have been used as the core model of an SE system and have been proven to provide better performance than traditional statistical and machine-learning methods Another well-known advantage of DL models is that they can flexibly fuse data from different domains [64], [65]. Based on the special design of model architecture and data augmentation, iLAVSE can effectively overcome the above three issues and provide more robust SE performance than LAVSE and several related SE methods.

RELATED WORK

PROPOSED ILAVSE SYSTEM

Data Quantization

Three Practical Issues and Proposed Solutions

EXPERIMENTS

Experimental Setup

Experimental Result

Noisy AOSE(DP) iLAVSE (a) SRMR

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2022
Citations: 13	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Improved Lite Audio-Visual Speech Enhancement

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification
Aytuğ Onan
Journal of King Saud University - Computer and Information Sciences | VOL. 34
Aytuğ OnanAytuğ Onan
12 Mar 2022
Journal of King Saud University - Computer and Information Sciences | VOL. 34

Spoken Language Identification System Using Convolutional Recurrent Neural Network
Adal A Alashban ... Ali H Meftah
Applied Sciences | VOL. 12
Adal A Alashban, et. al.Adal A Alashban ... Ali H Meftah
13 Sep 2022
Applied Sciences | VOL. 12

A Convolutional Recurrent Neural Network for the Handwritten Text Recognition of Historical Greek Manuscripts
K Markou ... S Symeonidis
-
K Markou, et. al.K Markou ... S Symeonidis
01 Jan 2020
01 Jan 2020

Sequential Convolutional Recurrent Neural Networks for Fast Automatic Modulation Classification
Kaisheng Liao ... Yi Zhong
IEEE Access | VOL. 9
Kaisheng Liao, et. al.Kaisheng Liao ... Yi Zhong
01 Jan 2020
IEEE Access | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improved Lite Audio-Visual Speech Enhancement

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing