Noisy Audio Research Articles

Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.

Next-generation audio-visual (AV) hearing aids stand as a major enabler to realize more intelligible audio. However, high data rate, low latency, low computational complexity, and privacy are some of the major bottlenecks to the successful deployment of such advanced hearing aids. To address these challenges, we propose an integration of 5G Cloud-Radio Access Network (C-RAN), Internet of Things (IoT), and strong privacy algorithms to fully benefit from the possibilities these technologies have to offer. Existing audio-only hearing aids are known to perform poorly in noisy situations where overwhelming noise is present. Current devices make the signal more audible but remain deficient in restoring intelligibility. Thus, there is a need for hearing aids that can selectively amplify the attended talker or filter out acoustic clutter. The proposed 5G IoT-enabled AV hearing-aid framework transmits the encrypted compressed AV information and receives encrypted enhanced reconstructed speech in real time to address cybersecurity attacks such as location privacy and eavesdropping. For security implementation, a real-time lightweight AV encryption is proposed, based on a piece-wise linear chaotic map (PWLSM), Chebyshev map, and a secure hash and S-Box algorithm. For speech enhancement, the received secure AV (including lip-reading) information in the cloud is used to filter noisy audio using both deep learning and analytical acoustic modelling. To offload the computational complexity and real-time optimization issues, the framework runs deep learning and big data optimization processes in the background, on the cloud. The effectiveness and security of the proposed 5G-IoT-enabled AV hearing-aid framework are extensively evaluated using widely known security metrics. Our newly reported, deep learning-driven lip-reading approach for speech enhancement is evaluated under four different dynamic real-world scenarios (cafe, street, public transport, pedestrian area) using benchmark Grid and ChiME3 corpora. Comparative critical analysis in terms of both speech enhancement and AV encryption demonstrates the potential of the envisioned technology to deliver high-quality speech reconstruction and secure mobile AV hearing aid communication. We believe our proposed 5G IoT enabled AV hearing aid framework is an effective and feasible solution and represents a step change in the development of next-generation multimodal digital hearing aids. The ongoing and future work includes more extensive evaluation and comparison with benchmark lightweight encryption algorithms and hardware prototype implementation.

Noisy Audio Research Articles

Related Topics

Articles published on Noisy Audio

Wiener Filter with Convolutional Neural Network for Noise Removal in API-Based AI Models

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Visual Hallucination Elevates Speech Recognition

AUDIO SPECTROGRAM TRANSFORMER (AST): ADVANTAGES OVER TRADITIONAL ALGORITHMS IN SPEECH-TO-TEXT (STT)

Development of acoustic denoising learning network for communication enhancement in construction sites

Self-supervised speech denoising using only noisy audio signals

TTS - VLSP 2021: Development of Smartcall Vietnamese Text-to-Speech

A Novel Human-Vehicle Interaction Assistive Device for Arab Drivers Using Speech Recognition

Application of Fusion of Various Spontaneous Speech Analytics Methods for Improving Far-Field Neural-Based Diarization

Mixture of Inference Networks for VAE-Based Audio-Visual Speech Enhancement

GRACE: Generating Summary Reports Automatically for Cognitive Assistance in Emergency Response

Robust Audio Content Classification Using Hybrid-Based SMD and Entropy-Based VAD.

A Novel Real-Time, Lightweight Chaotic-Encryption Scheme for Next-Generation Audio-Visual Hearing Aids

A Fuzzy Approach to Mute Sensitive Information in Noisy Audio Conversations

Windowed Adaptive Filtering for Reducing Noise in Audio Signals during Transmission to Remote Locations

MISNA - A musical instrument segregation system from noisy audio with LPCC-S features and extreme learning

MISNA - A musical instrument segregation system from noisy audio with LPCC-S features and extreme learning

BUILDING A NOISY AUDIO DATASET TO EVALUATE MACHINE LEARNING APPROACHES FOR AUTOMATIC SPEECH RECOGNITION SYSTEMS

Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments

Using Paired Distances of Signal Peaks in Stereo Channels as Fingerprints for Copy Identification

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Noisy Audio Research Articles

Related Topics

Articles published on Noisy Audio

Wiener Filter with Convolutional Neural Network for Noise Removal in API-Based AI Models

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Visual Hallucination Elevates Speech Recognition

AUDIO SPECTROGRAM TRANSFORMER (AST): ADVANTAGES OVER TRADITIONAL ALGORITHMS IN SPEECH-TO-TEXT (STT)

Development of acoustic denoising learning network for communication enhancement in construction sites

Self-supervised speech denoising using only noisy audio signals

TTS - VLSP 2021: Development of Smartcall Vietnamese Text-to-Speech

A Novel Human-Vehicle Interaction Assistive Device for Arab Drivers Using Speech Recognition

Application of Fusion of Various Spontaneous Speech Analytics Methods for Improving Far-Field Neural-Based Diarization

Mixture of Inference Networks for VAE-Based Audio-Visual Speech Enhancement

GRACE: Generating Summary Reports Automatically for Cognitive Assistance in Emergency Response

Robust Audio Content Classification Using Hybrid-Based SMD and Entropy-Based VAD.

A Novel Real-Time, Lightweight Chaotic-Encryption Scheme for Next-Generation Audio-Visual Hearing Aids

A Fuzzy Approach to Mute Sensitive Information in Noisy Audio Conversations

Windowed Adaptive Filtering for Reducing Noise in Audio Signals during Transmission to Remote Locations

MISNA - A musical instrument segregation system from noisy audio with LPCC-S features and extreme learning

MISNA - A musical instrument segregation system from noisy audio with LPCC-S features and extreme learning

BUILDING A NOISY AUDIO DATASET TO EVALUATE MACHINE LEARNING APPROACHES FOR AUTOMATIC SPEECH RECOGNITION SYSTEMS

Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments

Using Paired Distances of Signal Peaks in Stereo Channels as Fingerprints for Copy Identification