Abstract

Emotion recognition using multimodal data is a widely adopted approach due to its potential to enhance human interactions and various applications. By leveraging multimodal data for emotion recognition, the quality of human interactions can be significantly improved. We present the Multimodal Emotion Lines Dataset (MELD) and a novel method for multimodal emotion recognition using a bi-lateral gradient graph neural network (Bi-LG-GNN) and feature extraction and pre-processing. The multimodal dataset uses fine-grained emotion labeling for textual, audio, and visual modalities. This work aims to identify affective computing states successfully concealed in the textual and audio data for emotion recognition and sentiment analysis. We use pre-processing techniques to improve the quality and consistency of the data to increase the dataset’s usefulness. The process also includes noise removal, normalization, and linguistic processing to deal with linguistic variances and background noise in the discourse. The Kernel Principal Component Analysis (K-PCA) is employed for feature extraction, aiming to derive valuable attributes from each modality and encode labels for array values. We propose a Bi-LG-GCN-based architecture explicitly tailored for multimodal emotion recognition, effectively fusing data from various modalities. The Bi-LG-GCN system takes each modality's feature-extracted and pre-processed representation as input to the generator network, generating realistic synthetic data samples that capture multimodal relationships. These generated synthetic data samples, reflecting multimodal relationships, serve as inputs to the discriminator network, which has been trained to distinguish genuine from synthetic data. With this approach, the model can learn discriminative features for emotion recognition and make accurate predictions regarding subsequent emotional states. Our method was evaluated on the MELD dataset, yielding notable results in terms of accuracy (80%), F1-score (81%), precision (81%), and recall (81%) when using the MELD dataset. The pre-processing and feature extraction steps enhance input representation quality and discrimination. Our Bi-LG-GCN-based approach, featuring multimodal data synthesis, outperforms contemporary techniques, thus demonstrating its practical utility.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call