Audio Codec Research Articles

Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.

Since the advent of modern computing, researchers have striven to make the human-computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components-intonation and rhythm-both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

Audio Codec Research Articles

Related Topics

Articles published on Audio Codec

Audio-visual deepfake detection using articulatory representation learning

Temporal–Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

Non-intrusive method for audio quality assessment of lossy-compressed music recordings using convolutional neural networks

Lossless audio CODEC using non-repeated dynamic block encoding

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

Visual Hallucination Elevates Speech Recognition

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

SECap: Speech Emotion Captioning with Large Language Model

Voice Synthesis Improvement by Machine Learning of Natural Prosody.

Lossless audio codec based on CNN, weighted tree and arithmetic encoding (LACCWA)

Talking face generation driven by time–frequency domain features of speech audio

Application of digital audio decoding based on big data service in piano teaching system

Sound-to-Imagination: An Exploratory Study on Cross-Modal Translation Using Diverse Audiovisual Data

Remote ultrasound real‐time consultation and quality control system

Compression strategies for large-scale electrophysiology data.

Dynamic facial expression recognition with pseudo‐label guided multi‐modal pre‐training

The Design and Implementation of a Steganographic Communication System over In-Band Acoustical Channels

Variational Autoencoder with CCA for Audio–Visual Cross-modal Retrieval

Video Conference System in Mixed Reality Using a Hololens

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Audio Codec Research Articles

Related Topics

Articles published on Audio Codec

Audio-visual deepfake detection using articulatory representation learning

Temporal–Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

Non-intrusive method for audio quality assessment of lossy-compressed music recordings using convolutional neural networks

Lossless audio CODEC using non-repeated dynamic block encoding

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

Visual Hallucination Elevates Speech Recognition

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

SECap: Speech Emotion Captioning with Large Language Model

Voice Synthesis Improvement by Machine Learning of Natural Prosody.

Lossless audio codec based on CNN, weighted tree and arithmetic encoding (LACCWA)

Talking face generation driven by time–frequency domain features of speech audio

Application of digital audio decoding based on big data service in piano teaching system

Sound-to-Imagination: An Exploratory Study on Cross-Modal Translation Using Diverse Audiovisual Data

Remote ultrasound real‐time consultation and quality control system

Compression strategies for large-scale electrophysiology data.

Dynamic facial expression recognition with pseudo‐label guided multi‐modal pre‐training

The Design and Implementation of a Steganographic Communication System over In-Band Acoustical Channels

Variational Autoencoder with CCA for Audio–Visual Cross-modal Retrieval

Video Conference System in Mixed Reality Using a Hololens