Abstract
ABSTRACT The extraction of multimodal emotional information enables a more nuanced representation of the emotional subtleties embedded in film and television works. However, conventional approaches that independently extract features from images and text fail to capture the rich semantic interplay between these modalities, impeding the feature learning process within each modality. To address this, this paper presents an innovative model for extracting emotional information from multimodal film and television content. The model utilizes DenseNe for image feature extraction, enhancing network depth via the MSC module. Text feature extraction is achieved through a Transformer encoder, while video feature extraction employs a 3D CNN model. Refinements are made to the number and placement of convolutional layers, planar convolution size, and 3D convolution depth. Moreover, a multi-head scaled dot-product attention mechanism is incorporated into the interaction module to compute the similarity between each image block within the sequence and every word in the text sequence. Experimental evaluations on the CMU-MOSEI and CMU-MOSI datasets showcase superior performance compared to the baseline model, achieving accuracy and F1 scores of 0.708 and 0.698, respectively. Noteworthy is the proposed adaptive feature fusion module, which enriches the expressiveness of pivotal emotional features while eliminating redundant data.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have