HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement With Low-Quality Video

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement With Low-Quality Video

Similar Papers
  • Conference Article
  • Cite Count Icon 7
  • 10.1109/icassp43922.2022.9746866
Time-Domain Audio-Visual Speech Separation on Low Quality Videos
  • May 23, 2022
  • Yifei Wu + 4 more

Incorporating visual information is a promising approach to improve the performance of speech separation. Many related works have been conducted and provide inspiring results. However, low quality videos appear commonly in real scenarios, which may significantly degrade the performance of normal audio-visual speech separation system. In this paper, we propose a new structure to fuse the audio and visual features, which uses the audio feature to select relevant visual features by utilizing the attention mechanism. A Conv-TasNet based model is combined with the proposed attention-based multi-modal fusion, trained with proper data augmentation and evaluated with 3 categories of low quality videos. The experimental results show that our system outperforms the baseline which simply concatenates the audio and visual features when training with normal or low quality data, and is robust to low quality video inputs at inference time.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 51
  • 10.1109/tetci.2019.2917039
Lip-Reading Driven Deep Learning Approach for Speech Enhancement
  • Jun 1, 2021
  • IEEE Transactions on Emerging Topics in Computational Intelligence
  • Ahsan Adeel + 3 more

This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The approach leverages the complementary strengths of both deep learning and analytical acoustic modeling (filtering-based approach) as compared to benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lip-reading regression model is employed. In the second level, lip-reading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for estimating the clean audio power spectrum. Specifically, a stacked long-short-term memory (LSTM) based lip-reading regression model is designed for estimating the clean audio features using only temporal visual features (i.e., lip reading), by considering a range of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits the estimated speech features. The EVWF is compared with conventional spectral subtraction and log-minimum mean-square error methods using both ideal AV mapping and LSTM driven AV mapping approaches. The potential of the proposed AV speech enhancement framework is evaluated under four different dynamic real-world scenarios [cafe, street junction, public transport, and pedestrian area] at different SNR levels (ranging from low to high SNRs) using benchmark grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvements in terms of both speech quality and speech intelligibility. Ongoing work is aimed at enhancing the accuracy and generalization capability of the deep learning driven lip-reading model, using contextual integration of AV cues, leading to context-aware, autonomous AV speech enhancement.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icassp49357.2023.10096507
Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement
  • Jun 4, 2023
  • Chen-Yue Zhang + 5 more

Video information has been widely introduced to speech enhancement as its contribution at low signal-to-noise ratios (SNRs). Conventional audio-visual speech enhancement networks take noisy speech and video as input and learn features of clean speech directly. To reduce the large SNR gap between the learning target and input noisy speech, we propose a novel mask-based audio-visual progressive learning speech enhancement (AVPL) framework with visual information reconstruction (VIR) to increase SNRs gradually. Each stage of AVPL takes a concatenation of pre-trained visual embedding and the previous representation as input and predicts a mask with the intermediate representation of the current stage. To extract more visual information and deal with the performance distortion, the AVPL-VIR model reconstructs the visual embedding as it is fed in for each stage. Experiment on the TCD-TIMIT dataset shows that the progressive learning method significantly outperforms direct learning for both audio-only and audio-visual models. Moreover, by reconstructing video information, the VIR module provides a more accurate and comprehensive representation of the data, which in turn improves the performance of both AVDL and AVPL.

  • Research Article
  • Cite Count Icon 5
  • 10.1109/tai.2024.3366141
Robust Real-Time Audio–Visual Speech Enhancement Based on DNN and GAN
  • Nov 1, 2025
  • IEEE Transactions on Artificial Intelligence
  • Mandar Gogate + 2 more

The human auditory cortex contextually integrates audio-visual (AV) cues to better understand speech in a cocktail party situation. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in low signal-to-noise ratios (SN R < −5dB) environments compared to audio-only (A-only) SE models. However, despite substantial research in the area of AV SE, development of real-time processing models that can generalise across various types of visual and acoustic noises remains a formidable technical challenge. This paper introduces a novel framework for low-latency, speaker-independent AV SE. The proposed framework is designed to generalise to visual and acoustic noises encountered in real world settings. In particular, a generative adversarial network (GAN) is proposed to address the issue of visual speech noise including poor lighting in real noisy environments. In addition, a novel real-time AV SE based on a deep neural network is proposed. The model leverages the enhanced visual speech from the GAN to deliver robust SE. The effectiveness of the proposed framework is evaluated on synthetic AV datasets using objective speech quality and intelligibility metrics. Furthermore, subjective listening tests are conducted using real noisy AV corpora. The results demonstrate that the proposed real-time AV SE framework improves the mean opinion score by 20% as compared to state-of-the-art SE approaches including recent DNN based AV SE models.

  • Research Article
  • Cite Count Icon 75
  • 10.1016/j.inffus.2019.08.008
Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments
  • Aug 19, 2019
  • Information Fusion
  • Ahsan Adeel + 2 more

Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

  • Research Article
  • Cite Count Icon 25
  • 10.1016/j.specom.2019.10.006
Deep-learning-based audio-visual speech enhancement in presence of Lombard effect
  • Oct 30, 2019
  • Speech Communication
  • Daniel Michelsanti + 3 more

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 11
  • 10.1109/slt54892.2023.10023284
AVSE Challenge: Audio-Visual Speech Enhancement Challenge
  • Jan 9, 2023
  • Andrea Lorena Aldana Blanco + 6 more

Audio-visual speech enhancement is the task of improving the quality of a speech signal when video of the speaker is available. It opens-up the opportunity of improving speech intelligibility in adverse listening scenarios that are currently too challenging for audio-only speech enhancement models. The Audio-Visual Speech Enhancement (AVSE) challenge aims to set the first benchmark in this area. We provide participants with datasets and scripts to test their audio-visual speech enhancement models under a common framework for both training and evaluation. The data is derived from real-world videos, and comprises noisy mixes, in which audio from target speaker is mixed with either a competing speaker or a noise signal. The submitted systems are evaluated by conducting AV intelligibility tests involving human participants. We expect this challenge to be a platform for advancing the field of audio-visual speech-enhancement and to provide further insight about the scope and limitations of current AV speech enhancement approaches.

  • Conference Article
  • Cite Count Icon 25
  • 10.1109/cvpr52688.2022.00805
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
  • Jun 1, 2022
  • Karren Yang + 4 more

Since facial actions such as lip movements contain significant information about speech content, it is not surprising that audio-visual speech enhancement methods are more accurate than their audio-only counterparts. Yet, state-of-the-art approaches still struggle to generate clean, realistic speech without noise artifacts and unnatural distortions in challenging acoustic environments. In this paper, we propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech codec, enabling efficient synthesis of clean, realistic speech from noisy signals. Given the importance of speaker-specific cues in speech, we focus on developing personalized models that work well for individual speakers. We demonstrate the efficacy of our approach on a new audio-visual speech dataset collected in an unconstrained, large vocabulary setting, as well as existing audio-visual datasets, outperforming speech enhancement baselines on both quantitative metrics and human evaluation studies. Please see the supplemental video for qualitative results <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> https://github.com/facebookresearch/facestar/releases/download/paper_materials/video.mp4.

  • Research Article
  • Cite Count Icon 204
  • 10.1109/taslp.2021.3066303
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
  • Jan 1, 2021
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • Daniel Michelsanti + 6 more

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance.

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/icassp39728.2021.9414133
Audio-Visual Speech Enhancement Method Conditioned in the Lip Motion and Speaker-Discriminative Embeddings
  • Jun 6, 2021
  • Koichiro Ito + 2 more

We propose an audio-visual speech enhancement (AVSE) method conditioned both on the speaker’s lip motion and on speaker-discriminative embeddings. We particularly explore a method of extracting the embeddings directly from noisy audio in the AVSE setting without an enrollment procedure. We aim to improve speech-enhancement performance by conditioning the model with the embedding. To achieve this goal, we devise an AV voice activity detection (AV-VAD) module and a speaker identification module for the AVSE model. The AV-VAD module assesses reliable frames from which the identification module can extract a robust embedding for achieving an enhancement with the lip motion. To effectively train our modules, we propose multi-task learning between the AVSE, speaker identification, and VAD. Experimental results show that (1) our method directly extracted robust speaker embeddings from the noisy audio without an enrollment procedure and (2) improved the enhancement performance compared with the conventional AVSE methods.

  • Research Article
  • 10.1109/tbme.2025.3610284
Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation.
  • Jan 1, 2025
  • IEEE transactions on bio-medical engineering
  • Richard Lee Lai + 9 more

Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. This study explores the effectiveness of audio-visual speech enhancement (AVSE) in improving the intelligibility of vocoded speech in cochlear implant (CI) simulations. We propose a speech enhancement framework called Self-Supervised Learning-based AVSE (SSL-AVSE), which uses visual cues such as lip and mouth movements along with corresponding speech. Features are extracted using the AV-HuBERT model and refined through a bidirectional LSTM. Experiments were conducted using the Taiwan Mandarin speech with video (TMSV) dataset. Objective evaluations showed improvements in PESQ from 1.43 to 1.67 and in STOI from 0.70 to 0.74. NCM scores increased by up to 87.2% over the noisy baseline. Subjective listening tests further demonstrated maximum gains of 45.2% in speech quality and 51.9% in word intelligibility. SSL-AVSE consistently outperforms AOSE and conventional AVSE baselines. Listening tests with statistically significant results confirm its effectiveness. In addition to its strong performance, SSL-AVSE demonstrates cross-lingual generalization: although it was pretrained on English data, it performs effectively on Mandarin speech. This finding highlights the robustness of the features extracted by a pretrained foundation model and their applicability across languages. To the best of our knowledge, no prior work has explored the application of AVSE to CI simulations. This study provides the first evidence that incorporating visual information can significantly improve the intelligibility of vocoded speech in CI scenarios.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.neucom.2023.126432
Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network
  • Jun 10, 2023
  • Neurocomputing
  • Yangke Li + 1 more

Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network

  • Research Article
  • Cite Count Icon 32
  • 10.1097/aud.0000000000000830
Audiovisual Enhancement of Speech Perception in Noise by School-Age Children Who Are Hard of Hearing.
  • Jul 1, 2020
  • Ear &amp; Hearing
  • Kaylah Lalonde + 1 more

The purpose of this study was to examine age- and hearing-related differences in school-age children's benefit from visual speech cues. The study addressed three questions: (1) Do age and hearing loss affect degree of audiovisual (AV) speech enhancement in school-age children? (2) Are there age- and hearing-related differences in the mechanisms underlying AV speech enhancement in school-age children? (3) What cognitive and linguistic variables predict individual differences in AV benefit among school-age children? Forty-eight children between 6 and 13 years of age (19 with mild to severe sensorineural hearing loss; 29 with normal hearing) and 14 adults with normal hearing completed measures of auditory and AV syllable detection and/or sentence recognition in a two-talker masker type and a spectrally matched noise. Children also completed standardized behavioral measures of receptive vocabulary, visuospatial working memory, and executive attention. Mixed linear modeling was used to examine effects of modality, listener group, and masker on sentence recognition accuracy and syllable detection thresholds. Pearson correlations were used to examine the relationship between individual differences in children's AV enhancement (AV-auditory-only) and age, vocabulary, working memory, executive attention, and degree of hearing loss. Significant AV enhancement was observed across all tasks, masker types, and listener groups. AV enhancement of sentence recognition was similar across maskers, but children with normal hearing exhibited less AV enhancement of sentence recognition than adults with normal hearing and children with hearing loss. AV enhancement of syllable detection was greater in the two-talker masker than the noise masker, but did not vary significantly across listener groups. Degree of hearing loss positively correlated with individual differences in AV benefit on the sentence recognition task in noise, but not on the detection task. None of the cognitive and linguistic variables correlated with individual differences in AV enhancement of syllable detection or sentence recognition. Although AV benefit to syllable detection results from the use of visual speech to increase temporal expectancy, AV benefit to sentence recognition requires that an observer extracts phonetic information from the visual speech signal. The findings from this study suggest that all listener groups were equally good at using temporal cues in visual speech to detect auditory speech, but that adults with normal hearing and children with hearing loss were better than children with normal hearing at extracting phonetic information from the visual signal and/or using visual speech information to access phonetic/lexical representations in long-term memory. These results suggest that standard, auditory-only clinical speech recognition measures likely underestimate real-world speech recognition skills of children with mild to severe hearing loss.

  • Conference Article
  • Cite Count Icon 30
  • 10.1109/apsipa.2016.7820732
Audio-visual speech enhancement using deep neural networks
  • Dec 1, 2016
  • Jen-Cheng Hou + 6 more

This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful complementary information to audio data, have been integrated with audio data in many speech-related approaches to attain more effective speech processing performance. This paper presents our investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance. The experimental results show that the performance of DNN with audio-visual inputs exceeds that of DNN with audio inputs only in four standardized objective evaluations, thereby confirming the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.

  • Research Article
  • Cite Count Icon 1
  • 10.1121/10.0008500
Children’s age matters, but not for audiovisual speech enhancement
  • Oct 1, 2021
  • The Journal of the Acoustical Society of America
  • Liesbeth Gijbels + 3 more

Articulation movements help us identify speech in noisy environments. While this has been observed at almost all ages, the size of the perceived benefit and its relationship to development in children is less understood. Here, we focus on exploring audiovisual speech benefit in typically developing children (N = 160) across a wide age range (4–15 years) by measuring performance via an online audiovisual speech performance task that is low in cognitive and linguistic demands. Specifically, we investigated how audiovisual speech benefit develops over age and the impact of some potentially important intrinsic (e.g., gender, phonological skills) and extrinsic (e.g., choice of stimuli) experimental factors. Our results show an increased performance of individual modalities (audio-only, audiovisual, visual-only) as a function of age, but no difference in the size of audiovisual speech enhancement. Furthermore, older children showed a significant impact of visually distracting stimuli (e.g., mismatched video), where this had no additional impact on performance of the youngest children. No phonological or gender differences were found given the low cognitive and linguistic demands of this task.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.