Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

FMFNet: A Faster Multimodal Fusion Network for action recognition via efficient modality compensation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

FMFNet: A Faster Multimodal Fusion Network for action recognition via efficient modality compensation

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 37
  • 10.3390/rs12030464
Multi-Evidence and Multi-Modal Fusion Network for Ground-Based Cloud Recognition
  • Feb 2, 2020
  • Remote Sensing
  • Shuang Liu + 4 more

In recent times, deep neural networks have drawn much attention in ground-based cloud recognition. Yet such kind of approaches simply center upon learning global features from visual information, which causes incomplete representations for ground-based clouds. In this paper, we propose a novel method named multi-evidence and multi-modal fusion network (MMFN) for ground-based cloud recognition, which could learn extended cloud information by fusing heterogeneous features in a unified framework. Namely, MMFN exploits multiple pieces of evidence, i.e., global and local visual features, from ground-based cloud images using the main network and the attentive network. In the attentive network, local visual features are extracted from attentive maps which are obtained by refining salient patterns from convolutional activation maps. Meanwhile, the multi-modal network in MMFN learns multi-modal features for ground-based cloud. To fully fuse the multi-modal and multi-evidence visual features, we design two fusion layers in MMFN to incorporate multi-modal features with global and local visual features, respectively. Furthermore, we release the first multi-modal ground-based cloud dataset named MGCD which not only contains the ground-based cloud images but also contains the multi-modal information corresponding to each cloud image. The MMFN is evaluated on MGCD and achieves a classification accuracy of 88.63% comparative to the state-of-the-art methods, which validates its effectiveness for ground-based cloud recognition.

  • Research Article
  • Cite Count Icon 28
  • 10.1016/j.eswa.2023.122314
Human-centric multimodal fusion network for robust action recognition
  • Oct 31, 2023
  • Expert Systems with Applications
  • Zesheng Hu + 4 more

Human-centric multimodal fusion network for robust action recognition

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 35
  • 10.3389/fpls.2021.809506
Citrus Huanglongbing Detection Based on Multi-Modal Feature Fusion Learning.
  • Dec 23, 2021
  • Frontiers in Plant Science
  • Dongzi Yang + 4 more

Citrus Huanglongbing (HLB), also named citrus greening disease, occurs worldwide and is known as a citrus cancer without an effective treatment. The symptoms of HLB are similar to those of nutritional deficiency or other disease. The methods based on single-source information, such as RGB images or hyperspectral data, are not able to achieve great detection performance. In this study, a multi-modal feature fusion network, combining a RGB image network and hyperspectral band extraction network, was proposed to recognize HLB from four categories (HLB, suspected HLB, Zn-deficient, and healthy). Three contributions including a dimension-reduction scheme for hyperspectral data based on a soft attention mechanism, a feature fusion proposal based on a bilinear fusion method, and auxiliary classifiers to extract more useful information are introduced in this manuscript. The multi-modal feature fusion network can effectively classify the above four types of citrus leaves and is better than single-modal classifiers. In experiments, the highest accuracy of multi-modal network recognition was 97.89% when the amount of data was not very abundant (1,325 images of the four aforementioned types and 1,325 pieces of hyperspectral data), while the single-modal network with RGB images only achieved 87.98% recognition and the single-modal network using hyperspectral information only 89%. Results show that the proposed multi-modal network implementing the concept of multi-source information fusion provides a better way to detect citrus HLB and citrus deficiency.

  • Research Article
  • 10.52783/jes.3055
Application of Multimodal Data Fusion Attentive Dual Residual Generative Adversarial Network in Sentiment Recognition and Sentiment Analysis
  • Apr 4, 2024
  • Journal of Electrical Systems
  • Yongfang Zhang

Recent advancements in Internet technology have led to increased multi-modal data posting on social media, online shopping portals, and video repositories recognizing significance of inter-modal utterances before combining multiple modes. In this manuscript, Application of Multimodal Data Fusion Attentive Dual Residual Generative Adversarial Network in Sentiment Recognition and Sentiment Analysis (MDF-DRGAN-SR-SA) is proposed. The input data are collected from CMU-MOSI dataset. Initially the input data is preprocessed using Subaperture Keystone Transform Matched Filtering (SAKTMF) is used to clean unwanted data. Then, feature extraction is done by Two-Sided Offset Quaternion Linear Canonical Transform (TSOQLCT) to extract unimodal features likes acoustic, textual, visual. Then the selected features are given to ADRGAN classifying Sentiment Recognition and Sentiment Analysis likes positive, negative, neutral. In general, ADRGAN doesn’t express some adaption of optimization strategies for determining optimal parameters to assure accurate classification of Sentiment Recognition and Sentiment Analysis. Hence, Northern Goshawk Optimization Algorithm (GOA) is proposed to enhance weight parameter of ADRGAN, which precisely classifies the Sentiment Recognition and Sentiment Analysis in positive, negative and neutral. The proposed model is implemented and its efficiency is evaluated utilizing some performance metrics likes accuracy, precision, specificity, sensitivity,F1-score. The MDF-DRGAN-SR-SA method provides 25.85%, 26.79% and 27.63% higher accuracy; 35.66%, 34.97% and 26.57% higher precision; 28.18%, 29.52% and 25.68% higher specificity is compared with existing method such as Two-Level Multimodal Fusion for SA in Public Security (TMDF-SA-PS), Multimodal SA Depend on Adaptive Modality-Specific Weight Fusion Network (MFN-SA-AMW) and Multimodal SA Utilizing Multi-tensor Fusion Network and Cross-modal Modeling(MTFN-SA) respectively.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.jvcir.2025.104459
Attention mechanism based multimodal feature fusion network for human action recognition
  • Jul 1, 2025
  • Journal of Visual Communication and Image Representation
  • Xu Zhao + 5 more

Attention mechanism based multimodal feature fusion network for human action recognition

  • Research Article
  • Cite Count Icon 4
  • 10.3390/s25206278
A Transformer-Based Multimodal Fusion Network for Emotion Recognition Using EEG and Facial Expressions in Hearing-Impaired Subjects
  • Oct 10, 2025
  • Sensors (Basel, Switzerland)
  • Shuni Feng + 3 more

Hearing-impaired people face challenges in expressing and perceiving emotions, and traditional single-modal emotion recognition methods demonstrate limited effectiveness in complex environments. To enhance recognition performance, this paper proposes a multimodal fusion neural network based on a multimodal multi-head attention fusion neural network (MMHA-FNN). This method utilizes differential entropy (DE) and bilinear interpolation features as inputs, learning the spatial–temporal characteristics of brain regions through an MBConv-based module. By incorporating the Transformer-based multi-head self-attention mechanism, we dynamically model the dependencies between EEG and facial expression features, enabling adaptive weighting and deep interaction of cross-modal characteristics. The experiment conducted a four-classification task on the MED-HI dataset (15 subjects, 300 trials). The taxonomy included happy, sad, fear, and calmness, where ‘calmness’ corresponds to a low-arousal neutral state as defined in the MED-HI protocol. Results indicate that the proposed method achieved an average accuracy of 81.14%, significantly outperforming feature concatenation (71.02%) and decision layer fusion (69.45%). This study demonstrates the complementary nature of EEG and facial expressions in emotion recognition among hearing-impaired individuals and validates the effectiveness of feature layer interaction fusion based on attention mechanisms in enhancing emotion recognition performance.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s12539-025-00783-7
HPCSMN: A Classification Method of Chemotherapy Sensitivity of Hypopharyngeal Cancer Based on Multimodal Network.
  • Nov 18, 2025
  • Interdisciplinary sciences, computational life sciences
  • Weiqi Fu + 5 more

The treatment of hypopharyngeal cancer faces complex challenges, and accurate prediction of chemotherapy sensitivity is crucial for personalized treatment. In this study, a multimodal fusion network based on deep learning was used to classify the chemotherapy sensitivity of hypopharyngeal cancer, and the prediction accuracy was improved by integrating 3D CT images and radiomic features. The preprocessed and enhanced 3D CT images were analyzed by 3D ResNet branches to extract spatial features; the radiomic features screened by LASSO regression were processed by three layers of fully connected branches to analyze the tabular data. The extracted vectors were fused by fully connected layers, using complementary advantages to capture complex spatial dependencies and detailed radiomic features. Experiments on the manually segmented NKU-TMU-hphc dataset (containing 102 hypopharyngeal cancer CT images) showed that the multimodal fusion network had high accuracy and outperformed single-modality methods and other models in multiple evaluation indicators. Statistical analysis was performed on the extracted features and clinical characteristics. The model effectively integrates image and clinical data, provides a new method for chemotherapy sensitivity classification, and is expected to improve personalized medicine.

  • Research Article
  • 10.3389/frai.2025.1663292
Multi-modal texture fusion network for detecting AI-generated images
  • Oct 22, 2025
  • Frontiers in Artificial Intelligence
  • Haozheng Yu + 1 more

With the rapid advancement of AI-generated content, detecting synthetic images has become a critical task in digital forensics and media integrity. In this paper, we propose a novel multi-modal fusion network that leverages complementary texture and content information to improve the detection of AI-generated images. Our approach integrates three input branches: the original RGB image, a local binary pattern (LBP) map to capture micro-texture irregularities, and a gray-level co-occurrence matrix (GLCM) representation to encode statistical texture dependencies. These three streams are processed in parallel through a shared-weight convolutional backbone and subsequently fused at the feature level to enhance discrimination capability. Extensive experiments conducted on benchmark datasets demonstrate that our method outperforms existing single-modality baselines and achieves strong generalization across multiple types of generative models. The proposed fusion framework offers an interpretable and efficient solution for robust and reliable detection of AI-synthesized imagery.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1109/access.2020.3014691
A Multimodal Pairwise Discrimination Network for Cross-Domain Action Recognition
  • Jan 1, 2020
  • IEEE Access
  • Fuhua Shang + 4 more

In recent years, action recognition has become a hot research topic in the computer vision and machine learning domain. Despite many well-designed action recognition approaches have been proposed, we point out that some limitations still exist including the separated fusion of different Spatio-temporal features and the reconstruction classification model, and the requirement of similar environmental conditions when capturing the training and testing data. Thus, research interest has shifted from traditional action recognition towards cross-domain action recognition. To solve these limitations, in this work, we propose a novel multimodal pairwise discrimination network (short for MPD) for cross-domain action recognition that is an end-to-end network architecture. In MPD, it can jointly fuse different Spatio-temporal features from the video, learn domain invariant features for different action domains (source and target domains), and build the classification model. To characterize the shift between these domains, subnetwork parameters in corresponding layers of MPD are required to be relevant, but not identical. Besides, the domain invariant feature discrimination needs to be improved. Extensive experimental results on two different public benchmarks including indoor environment and outdoor environment demonstrate that our MPD solution can significantly outperform state-of-the-art methods with a 4% to 20% improvement in average accuracy.

  • Research Article
  • Cite Count Icon 4
  • 10.32604/cmc.2023.037794
MFF-Net: Multimodal Feature Fusion Network for 3D Object Detection
  • Jan 1, 2023
  • Computers, Materials & Continua
  • Peicheng Shi + 3 more

In complex traffic environment scenarios, it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance. The accuracy of 3D object detection will be affected by problems such as illumination changes, object occlusion, and object detection distance. To this purpose, we face these challenges by proposing a multimodal feature fusion network for 3D object detection (MFF-Net). In this research, this paper first uses the spatial transformation projection algorithm to map the image features into the feature space, so that the image features are in the same spatial dimension when fused with the point cloud features. Then, feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features, suppress useless features, and increase the directionality of the network to features. Finally, this paper increases the probability of false detection and missed detection in the non-maximum suppression algorithm by increasing the one-dimensional threshold. So far, this paper has constructed a complete 3D target detection network based on multimodal feature fusion. The experimental results show that the proposed achieves an average accuracy of 82.60% on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, outperforming previous state-of-the-art multimodal fusion networks. In Easy, Moderate, and hard evaluation indicators, the accuracy rate of this paper reaches 90.96%, 81.46%, and 75.39%. This shows that the MFF-Net network has good performance in 3D object detection.

  • Research Article
  • Cite Count Icon 22
  • 10.1002/int.23084
Semantic‐enhanced multimodal fusion network for fake news detection
  • Sep 22, 2022
  • International Journal of Intelligent Systems
  • Shuo Li + 3 more

The increasing popularity of social media facilitates the propagation of fake news, posing a major threat to the government and journalism, and thereby making how to detect fake news from social media an urgent requirement. In general, multimodal-based methods can achieve better performance because of the complementation among different modalities. However, the majority of them simply concatenate features from different modalities, failing to well preserve the mutual information in common features. To address this issue, a novel framework named semantic-enhanced multimodal fusion network is proposed for fake news detection, which can better capture mutual features among events and thus benefit the detection of fake news. This model consists of three subnetworks, namely multimodal fusion and event domain adaptation networks as well as the fake news detector. Specifically, the multimodal fusion network aims to extract deep features from texts and images and fuse them into a common semantic feature known as a snapshot. Then, the fake news detector can learn the representation of posts. Finally, the event domain adaptation network can single out and remove the peculiar features of each event, and keep shared features among events. The experimental results show that the proposed model outperforms some state-of-the-art approaches on two real-world multimedia data sets.

  • Research Article
  • Cite Count Icon 19
  • 10.1016/j.jvcir.2023.104019
Multimodal fusion hierarchical self-attention network for dynamic hand gesture recognition
  • Dec 12, 2023
  • Journal of Visual Communication and Image Representation
  • Pranav Balaji + 1 more

Multimodal fusion hierarchical self-attention network for dynamic hand gesture recognition

  • Research Article
  • Cite Count Icon 35
  • 10.1016/j.knosys.2020.106639
Multimodal deep fusion for image question answering
  • Nov 28, 2020
  • Knowledge-Based Systems
  • Weifeng Zhang + 3 more

Multimodal deep fusion for image question answering

  • Research Article
  • Cite Count Icon 5
  • 10.1007/s10278-023-00810-3
RTFusion: A Multimodal Fusion Network with Significant Information Enhancement.
  • Apr 10, 2023
  • Journal of digital imaging
  • Chao Fan + 4 more

Multimodal medical fusion images are important for clinical diagnosis because they can better reflect the location of disease and provide anatomically detailed information. Existing medical image fusion methods can cause significant information loss in fusion images to varying degrees. Therefore, we designed a residual transformer fusion network (RTFusion): a multimodal fusion network with significant information enhancement. We use the residual transformer to make the image information interact remotely to ensure the global information of the image and use the residual structure to enhance the feature information to prevent information loss. Then the channel attention and spatial attention module (CASAM) is added to the fusion process to enhance the significant information of the fusion image, and the feature interaction module is used to promote the interaction of specific information of the source image. Finally, the loss function of the block calculation is designed to drive the fusion network to retain rich texture details, structural information, and color information, to optimize the subjective visual effect of the image. Extensive experiments show that our method can better recover the significant information of the source image and outperform other advanced methods in subjective visual description and objective metric evaluation. In particular, the color information and texture information are balanced to enhance the visual effect of the fused image.

  • Conference Article
  • Cite Count Icon 154
  • 10.1109/cvprw56347.2022.00511
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation
  • Jun 1, 2022
  • Vishal Chudasama + 5 more

Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant