Analyzing, manipulating, and comprehending data from multiple sources (e.g., websites, software applications, files, or databases) and of diverse modalities (e.g., video, images, audio and text) has become increasingly important in many domains. Despite recent advances in multimodal classification (MC), there are still several challenges to be addressed, such as: the combination of modalities of very diverse nature, the optimal feature engineering for each modality, as well as the semantic alignment between text and images. Accordingly, the main motivation of our research relies in devising a neural architecture that effectively processes and combines text, image, video and audio modalities, so it can offer a noteworthy performance in different MC tasks. In this regard, the Multimodal Transformer (MulT) model is a cutting-edge approach often employed in multimodal supervised tasks, which, although effective, has the problem of having a fixed architecture that limits its performance in specific tasks as well as its contextual understanding, meaning it may struggle to capture fine-grained temporal patterns in audio or effectively model spatial relationships in images. To address these issues, our research modifies and extends the MulT model in several aspects. Firstly, we focus on leveraging the Gated Multimodal Unit (GMU) module within the architecture to efficiently and dynamically weigh modalities at the instance level and to visualize the use of modalities. Secondly, to overcome the problem of vanishing and exploding gradients we focus on strategically placing residual connections in the architecture. The proposed architecture is evaluated in two different and complex classification tasks, on the one hand, the movie genre categorization (MGC) and, on the other hand, the multimodal emotion recognition (MER). The results obtained are encouraging as they indicate that the proposed architecture is competitive against state-of-the-art (SOTA) models in MGC, outperforming them by up to 2% on the Moviescope dataset, and by 1% on the MM-IMDB datasets. Furthermore, in the MER task the unaligned version of the datasets was employed, which is considerably more difficult; we improve accuracy SOTA results by up to 1% on the IEMOCAP dataset, and attained a competitive outcome on the CMU-MOSEI11Dai et al., (2021) collection, outperforming SOTA results in several emotions.
Read full abstract