Audio-visual Fusion Research Articles

PurposeAlthough numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations.Design/methodology/approachA novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations.FindingsExtensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance.Originality/valueThe proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.

Read full abstract

With the rise in piano teaching in recent years, many people have joined the ranks of piano learners. However, the high cost of traditional manual instruction and the exclusive one-on-one teaching model have made learning the piano an extravagant endeavor. Most existing approaches, based on the audio modality, aim to evaluate piano players’ skills. Unfortunately, these methods overlook the information contained in videos, resulting in a one-sided and simplistic evaluation of the piano player’s skills. More recently, multimodal-based methods have been proposed to assess the skill level of piano players by using both video and audio information. However, existing multimodal approaches use shallow networks to extract video and audio features, which limits their ability to extract complex spatio-temporal and time-frequency characteristics from piano performances. Furthermore, the fingering and pitch-rhythm information of the piano performance is embedded within the spatio-temporal and time-frequency features, respectively. Therefore, we propose a ResNet-based audio-visual fusion model that is able to extract both the visual features of the player’s finger movement track and the auditory features, including pitch and rhythm. The joint features are then obtained through the feature fusion technique by capturing the correlation and complementary information between video and audio, enabling a comprehensive and accurate evaluation of the player’s skill level. Moreover, the proposed model can extract complex temporal and frequency features from piano performances. Firstly, ResNet18-3D is used as the backbone network for our visual branch, allowing us to extract feature information from the video data. Then, we utilize ResNet18-2D as the backbone network for the aural branch to extract feature information from the audio data. The extracted video features are then fused with the audio features, generating multimodal features for the final piano skill evaluation. The experimental results on the PISA dataset show that our proposed audio-visual fusion model, with a validation accuracy of 70.80% and an average training time of 74.02 s, outperforms the baseline model in terms of performance and operational efficiency. Furthermore, we explore the impact of different layers of ResNet on the model’s performance. In general, the model achieves optimal performance when the ratio of video features to audio features is balanced. However, the best performance achieved is 68.70% when the ratio differs significantly.

Read full abstract

Audio-visual Fusion Research Articles

Related Topics

Articles published on Audio-visual Fusion

Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging

Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

Audio-visual Fusion of Artificial Intelligence for Enhanced Human Recognition

CLIP2TF:Multimodal video–text retrieval for adolescent education

AV-FDTI: Audio-visual fusion for drone threat identification

A novel vehicle collision detection system: Integrating audio-visual fusion for enhanced performance

Audio-visual saliency prediction with multisensory perception and integration

Perceptual Training as Means to Assess the Effect of Alpha Frequency on Temporal Binding Window.

AVT$^{2}$-DWF: Improving Deepfake Detection With Audio-Visual Fusion and Dynamic Weighting Strategies

A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism

Audio-Visual Fusion Based on Interactive Attention for Person Verification.

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Two-Stage Fusion-Based Audiovisual Remote Sensing Scene Classification

Incremental Audio-Visual Fusion for Person Recognition in Earthquake Scene

AudioVisual Video Summarization.

Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-Attention

A ResNet-Based Audio-Visual Fusion Model for Piano Skill Evaluation

Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network

An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Audio-visual Fusion Research Articles

Related Topics

Articles published on Audio-visual Fusion

Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging

Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

Audio-visual Fusion of Artificial Intelligence for Enhanced Human Recognition

CLIP2TF:Multimodal video–text retrieval for adolescent education

AV-FDTI: Audio-visual fusion for drone threat identification

A novel vehicle collision detection system: Integrating audio-visual fusion for enhanced performance

Audio-visual saliency prediction with multisensory perception and integration

Perceptual Training as Means to Assess the Effect of Alpha Frequency on Temporal Binding Window.

AVT$^{2}$-DWF: Improving Deepfake Detection With Audio-Visual Fusion and Dynamic Weighting Strategies

A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism

Audio-Visual Fusion Based on Interactive Attention for Person Verification.

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Two-Stage Fusion-Based Audiovisual Remote Sensing Scene Classification

Incremental Audio-Visual Fusion for Person Recognition in Earthquake Scene

AudioVisual Video Summarization.

Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-Attention

A ResNet-Based Audio-Visual Fusion Model for Piano Skill Evaluation

Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network

An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario