Audio Input Research Articles

Depression is a common but severe mental disorder that adversely impacts the ability of an individual to function normally in their day-to-day life. A majority of depressed individuals remain undiagnosed due to factors such as social stigma and a shortage of healthcare professionals. Consequently, several Machine Learning and Deep Learning (DL) models based on speech have been proposed for automatic depression detection, with the latter generally outperforming the former. However, DL models are blackbox and offer no transparency. In contrast, healthcare professionals prefer models that provide interpretability besides being accurate. In this direction, we propose a method RADIANCE (Reliable AnD InterpretAble depressioN deteCtion transformErs). RADIANCE incorporates a novel FilterBank VIsion Transformer (FBViT) network, which provides the symptoms of depression as interpretable features. Additionally, we employ a novel loss function that handles the class imbalance issue in the datasets. It also incorporates a penalty term that addresses the hierarchy of misclassification errors. We also propose a reliability predictor based on low-level descriptors that provides a reliability score to indicate the trustworthiness of the prediction by FBViT. Furthermore, in contrast to the conventional averaging and majority pooling, RADIANCE consolidates predictions from multiple clips of the input audio by intricately weighing each prediction based on its reliability score, ensuring a more accurate overall prediction. RADIANCE outperforms the state-of-the-art depression detection methods, achieving an accuracy of 89.36%, 80.36%, and 94.44% over the DAIC-WOZ, E-DAIC, and CMDC datasets, respectively. Further, RADIANCE achieves MAE scores of 3.27 and 5.04 on the DAIC-WOZ and E-DAIC datasets, respectively.

Read full abstract

With the rapid development of social media and human–computer interaction, multimodal emotion recognition in conversations (MERC) tasks have begun to receive widespread research attention. The MERC task is to extract and fuse complementary semantic information from different modalities to classify the speaker’s emotion. However, the existing feature fusion methods usually directly map the features of other modalities into the same feature space for information fusion, which cannot eliminate the heterogeneity between different modalities and make the subsequent emotion class boundary learning more difficult. In addition, existing graph contrastive learning methods obtain consistent feature representations by maximizing mutual information between multiple views, which may lead to overfitting of the model. To tackle the above problem, we propose a novel Adversarial Alignment and Graph Fusion via Information Bottleneck for Multimodal Emotion Recognition in Conversations (AGF-IB) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features, respectively, through adversarial representation to achieve information interaction between modalities and eliminate the heterogeneity among modalities. Thirdly, we introduce graph contrastive representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Furthermore, instead of maximizing the mutual information (MI) between multiple views, we use information bottleneck theory to minimize the MI between views. Specifically, we construct a graph structure for the three modal features respectively and perform contrastive representation learning on nodes with different emotions in the same modality and nodes with the same emotion in different modalities, to improve the feature representation ability of nodes. Finally, we use MLP to complete the emotional classification of the speaker. Extensive experiments show that AGF-IB can improve emotion recognition accuracy on IEMOCAP and MELD datasets. Furthermore, since AGF-IB is a general multimodal fusion and contrastive learning method, it can be applied to other multimodal tasks in a plug-and-play manner, e.g., humor detection.

Read full abstract

Audio Input Research Articles

Related Topics

Articles published on Audio Input

Review of Talking Head Synthesis for Driving Mechanisms and Portrait Rendering

Advancements in Natural Language Processing for Human-Computer Interaction

RADIANCE: Reliable and interpretable depression detection from speech using transformer

Hierarchical-Concatenate Fusion TDNN for sound event classification.

Audio-Driven Facial Animation with Deep Learning: A Survey

Audio Deep Fake Detection with Sonic Sleuth Model

Enhancing human computer interaction with coot optimization and deep learning for multi language identification

Talking Face Generation With Audio-Deduced Emotional Landmarks.

Achieving liquid processors by colloidal suspensions for reservoir computing

3D Visual Grounding-Audio: 3D scene object detection based on audio

Research on Multi-Modal Music Score Alignment Model for Online Music Education

Depression Detection with Novel Large Language Model Methods

Headphones over the cochlear-implant sound processor to replace direct audio input.

Speed-Aware Audio-Driven Speech Animation using Adaptive Windows

Audio-guided implicit neural representation for local image stylization

Kombucha–Chlorella–Proteinoid Biosynthetic Classifiers of Audio Signals

Semantic-aware hyper-space deformable neural radiance fields for facial avatar reconstruction

KUMPAS: A Filipino Sign Language Translator of Deaf and Hearing Individuals Utilizing Computer Vision Algorithm

Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations

Duration of cochlear implant use in children with prelingual single-sided deafness is a predictor of word perception in the CI ear

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Audio Input Research Articles

Related Topics

Articles published on Audio Input

Review of Talking Head Synthesis for Driving Mechanisms and Portrait Rendering

Advancements in Natural Language Processing for Human-Computer Interaction

RADIANCE: Reliable and interpretable depression detection from speech using transformer

Hierarchical-Concatenate Fusion TDNN for sound event classification.

Audio-Driven Facial Animation with Deep Learning: A Survey

Audio Deep Fake Detection with Sonic Sleuth Model

Enhancing human computer interaction with coot optimization and deep learning for multi language identification

Talking Face Generation With Audio-Deduced Emotional Landmarks.

Achieving liquid processors by colloidal suspensions for reservoir computing

3D Visual Grounding-Audio: 3D scene object detection based on audio

Research on Multi-Modal Music Score Alignment Model for Online Music Education

Depression Detection with Novel Large Language Model Methods

Headphones over the cochlear-implant sound processor to replace direct audio input.

Speed-Aware Audio-Driven Speech Animation using Adaptive Windows

Audio-guided implicit neural representation for local image stylization

Kombucha–Chlorella–Proteinoid Biosynthetic Classifiers of Audio Signals

Semantic-aware hyper-space deformable neural radiance fields for facial avatar reconstruction

KUMPAS: A Filipino Sign Language Translator of Deaf and Hearing Individuals Utilizing Computer Vision Algorithm

Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations

Duration of cochlear implant use in children with prelingual single-sided deafness is a predictor of word perception in the CI ear