Immersive Visual Identity Authentication and Deepfake Detection Using Multimodal Feature Fusion for Secure Extended Reality Internet Service Environments

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

This study introduces a multimodal fusion-based immersive visual identity authentication and deepfake detection system for XR environments, achieving 98.7% accuracy—significantly outperforming unimodal models—and reducing spoofing vulnerability by 43%, with real-time processing within 34 ms latency, enhancing security in XR services.

Abstract
Translate article icon Translate Article Star icon

The rapid growth of Extended Reality (XR) Internet services has raised significant security concerns, especially for immersive visual identity authentication. Deepfake-based impersonation attacks can harm user trust and data confidentiality. Traditional biometric systems, which are mainly unimodal facial recognition, are susceptible to synthetic media manipulation and adversarial spoofing. This paper proposes an immersive visual identity authentication system, coupled with a deepfake detection system that leverages multimodal feature fusion to enhance security in XR settings. The given model integrates spatial-temporal facial representations, periocular texture representations, voice spectral representations, and behavioural motion patterns via a hybrid attention-based fusion network. An evaluation was conducted on a dataset of 18,500 authentic and 17,300 deepfake XR interaction samples. Experiments show that the multimodal fusion model achieves an authentication accuracy of 98.7%, which is much higher than that of unimodal models (face-only: 92.4%; voice-only: 89.1%). The proposed deepfake detection module achieves the following precision, recall, F1-score, and false acceptance rate (FAR): 97.9%, 98.3%, 98.1%, and 1.2%, respectively, representing a 43% decrease in spoofing vulnerability compared to traditional CNN-based detectors. Additionally, real-time viability is verified through latency analysis, with an average authentication cycle processing delay of 34 ms, which is within the constraints of immersive XR services. The results suggest that multimodal feature fusion is associated with a high level of resistance to identity verification in immersive Internet ecosystems under synthetic identity manipulation. The proposed framework will contribute to a secure, scalable, and reliable authentication infrastructure for next-generation XR-enabled digital services.

Similar Papers
  • Research Article
  • Cite Count Icon 22
  • 10.1016/j.knosys.2024.112022
MFUR-Net: Multimodal feature fusion and unimodal feature refinement for RGB-D salient object detection
  • May 31, 2024
  • Knowledge-Based Systems
  • Zhengqian Feng + 5 more

MFUR-Net: Multimodal feature fusion and unimodal feature refinement for RGB-D salient object detection

  • Research Article
  • Cite Count Icon 2
  • 10.1109/tim.2025.3555712
GLFNet: An RGB-T Crowd Counting Network Based on Global–Local Multimodal Feature Fusion
  • Jan 1, 2025
  • IEEE Transactions on Instrumentation and Measurement
  • Yingxiang Hu + 3 more

RGB-T crowd counting methods aim to enhance the counting accuracy of network models under conditions of uneven lighting and low visibility by fusing features from the RGB and thermal modalities. Previous approaches primarily utilized attention mechanisms to extract and fuse complementary RGB and thermal features. However, these methods lack guidance and constraints during the extraction and fusion of multi-modal features and do not fully leverage the complementary advantages between global and local features, leading to suboptimal performance. This paper argues that, by transitioning from global attention to local attention, extracting and fusing the complementary information between global and local multi-modal features can significantly improve the model’s counting performance. To achieve this, we propose an RGB-T crowd counting network based on global-local multimodal feature fusion (GLFNet). Specifically, we first use a multi-head attention mechanism to fuse global multi-modal features and guide the global multi-modal fusion using learnable block-counting guided tokens (BCT). Next, we employ composite spatial attention mechanisms (CSAM) to focus on the local detail information of multi-modal crowd features and facilitate the fusion of local multimodal features. Finally, we utilize a detail contrast loss function (<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">L<sub>d</sub></i>) to capture the complementary advantages between global and local multi-modal features and to guide and constrain the fusion process of multi-modal features. Experimental results on the RGBT-CC and DroneRGBT datasets demonstrate the superior performance of our method.

  • Research Article
  • Cite Count Icon 28
  • 10.1109/jbhi.2022.3161466
Adaptive Multimodal Fusion With Attention Guided Deep Supervision Net for Grading Hepatocellular Carcinoma.
  • Aug 1, 2022
  • IEEE Journal of Biomedical and Health Informatics
  • Shangxuan Li + 4 more

Multimodal medical imaging plays a crucial role in the diagnosis and characterization of lesions. However, challenges remain in lesion characterization based on multimodal feature fusion. First, current fusion methods have not thoroughly studied the relative importance of characterization modals. In addition, multimodal feature fusion cannot provide the contribution of different modal information to inform critical decision-making. In this study, we propose an adaptive multimodal fusion method with an attention-guided deep supervision net for grading hepatocellular carcinoma (HCC). Specifically, our proposed framework comprises two modules: attention-based adaptive feature fusion and attention-guided deep supervision net. The former uses the attention mechanism at the feature fusion level to generate weights for adaptive feature concatenation and balances the importance of features among various modals. The latter uses the weight generated by the attention mechanism as the weight coefficient of each loss to balance the contribution of the corresponding modal to the total loss function. The experimental results of grading clinical HCC with contrast-enhanced MR demonstrated the effectiveness of the proposed method. A significant performance improvement was achieved compared with existing fusion methods. In addition, the weight coefficient of attention in multimodal fusion has demonstrated great significance in clinical interpretation.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.bspc.2023.105756
Advancing classroom fatigue recognition: A multimodal fusion approach using self-attention mechanism
  • Nov 17, 2023
  • Biomedical Signal Processing and Control
  • Lei Cao + 3 more

Advancing classroom fatigue recognition: A multimodal fusion approach using self-attention mechanism

  • Conference Article
  • 10.1109/cei66465.2025.11398614
Real-Time Optimization of Multimodal Visual Feature Fusion for Environmental Perception in Autonomous Driving
  • Nov 21, 2025
  • Panbo Li

The development towards autonomous driving directly relates to the system's robustness and reliability in environmental perception. Multimodal feature fusion, particularly fusing data from high-resolution cameras and highprecision LiDAR, is now a new way to have all the information needed to drive safely in complex urban scenarios. While this type of fusion greatly improves perception precision, and is more conducive to coping with single sensor failure or poor weather, it imposes an excessively heavy computational load and consumes excessive memory. Deep neural networks needed for feature-level fusion are often highly complex, sometimes using transformerbased attention mechanisms, which means they require a lot of resources. This makes it difficult to process things fast enough (more than 10-20 frames per second) in autonomous vehicles, where rapid perception and response are critical. This gap between accuracy and computational feasibility is currently a major roadblock to deployment in the real world. To address this issue, this paper proposes a comprehensive optimization scheme for speeding up multimodal fusion networks without making it too tough on our ability to recognize things. We integrate three key strategies: (1) the lightweight and efficient fusion architecture Fast-Fuse Net, which removes the heavy backbone and employs a Sparse Attention-based modal interaction; (2) structured pruning, which removes redundant network parameters while maintaining hardware compatibility. (3) quantization-aware training (QAT) with low-precision (INT8) inference. We validate our method on the large-scale nuScenes dataset, and demonstrate that our fully optimized model achieves <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$2.8 \times$</tex> faster inference speed and a 65% reduction in model size, while retaining 98.5% of the baseline model's mAP for 3D object detection. This research provides a practical solution for deploying high-performance multimodal perception systems on the on-board computing devices of autonomous vehicles.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.entcom.2024.100709
E-Learning system application in art entrepreneurship teaching based on multimodal feature fusion and neural network
  • Jun 22, 2024
  • Entertainment Computing
  • Xinyi Wang

E-Learning system application in art entrepreneurship teaching based on multimodal feature fusion and neural network

  • Research Article
  • 10.37349/edht.2025.101175
Multimodal feature extraction and fusion for determining RGP lens specification base-curve through Pentacam images
  • Dec 8, 2025
  • Exploration of Digital Health Technologies
  • Leyla Ebrahimi + 3 more

Aim: Patients diagnosed with irregular astigmatism often require specific methods of vision correction. Among these, the use of a rigid gas permeable (RGP) lens is considered one of the most effective treatment approaches. This study aims to propose a new automated method for accurate RGP lens base-curve detection. Methods: A multi-modal feature fusion approach was developed based on Pentacam images, incorporating image processing and machine learning techniques. Four types of features were extracted from the images and integrated through a serial feature fusion mechanism. The fused features were then evaluated using a multi-layered perceptron (MLP) network. Specifically, the features included: (1) middle-layer outputs of a convolutional autoencoder (CAE) applied to RGB map combinations; (2) ratios of colored areas in the front cornea map; (3) a feature vector from cornea front parameters; and (4) the radius of the reference sphere/ellipse in the front elevation map. Results: Evaluations were performed on a manually labeled dataset. The proposed method achieved a mean squared error (MSE) of 0.005 and a coefficient of determination of 0.79, demonstrating improved accuracy compared to existing techniques. Conclusions: The proposed multi-modal feature fusion technique provides a reliable and accurate solution for RGP lens base-curve detection. This approach reduces manual intervention in lens fitting and represents a significant step toward automated base-curve determination.

  • PDF Download Icon
  • Research Article
  • 10.5194/isprs-archives-xlviii-g-2025-1785-2025
Remote sensing semantic segmentation based on multimodal feature alignment and fusion
  • Aug 2, 2025
  • The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
  • Boshen Chang + 1 more

Abstract. The accurate semantic segmentation of remote sensing data is of paramount importance to the success of geoscience research and applications. In comparison to traditional single-modal segmentation techniques, models based on multi-modal fusion have demonstrated superior performance and have been the subject of considerable attention in recent years. However, the majority of these models employ convolutional neural networks (CNNs) or visual transformers (ViTs) for fusion operations, which results in inadequate modelling and representation of local-global context. In this study, we propose a multi-layer multi-modal feature alignment and fusion scheme, designated as MFAFUNet, with the objective of providing a robust and effective multi-modal fusion backbone for semantic segmentation. The overarching algorithmic framework is analogous to that of the Unet model. First, the data in different modalities is aggregated and the image size is reduced through the use of multi-level downsampling modules based on the Haar wavelet transform. The high-frequency and low-frequency information of the features is extracted through a feature extraction module composed of a convolutional neural network (CNN) and a visual transformer (ViT). Second, through the semantic distribution alignment loss, the high-level features of different modal information are transformed into a common latent space, and their distributions are aligned to associate the complementary clues hidden in each modality. The effectiveness of the proposed method is demonstrated through experiments.

  • PDF Download Icon
  • Preprint Article
  • 10.32920/22734290
ECG Heartbeat Classification Using Multimodal Fusion
  • May 3, 2023
  • Zeeshan Ahmad + 3 more

&lt;p&gt;Electrocardiogram (ECG) is an authoritative source to diagnose and counter critical cardiovascular syndromes such as arrhythmia and myocardial infarction (MI). Current machine learning techniques either depend on manually extracted features or large and complex deep learning networks which merely utilize the 1D ECG signal directly. Since intelligent multimodal fusion can perform at the stateof-the-art level with an efficient deep network, therefore, in this paper, we propose two computationally efficient multimodal fusion frameworks for ECG heart beat classification called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF). At the input of these frameworks, we convert the raw ECG data into three different images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF). In MIF, we first perform image fusion by combining three imaging modalities to create a single image modality which serves as input to the Convolutional Neural Network (CNN). In MFF, we extracted features from penultimate layer of CNNs and fused them to get unique and interdependent information necessary for better performance of classifier. These informational features are finally used to train a Support Vector Machine (SVM) classifier for ECG heart-beat classification. We demonstrate the superiority of the proposed fusion models by performing experiments on PhysioNets MIT-BIH dataset for five distinct conditions of arrhythmias which are consistent with the AAMI EC57 protocols and on PTB diagnostics dataset for Myocardial Infarction (MI) classification. We achieved classification accuracy of 99.7% and 99.2% on arrhythmia and MI classification, respectively.&lt;/p&gt;

  • PDF Download Icon
  • Preprint Article
  • Cite Count Icon 1
  • 10.32920/22734290.v1
ECG Heartbeat Classification Using Multimodal Fusion
  • May 3, 2023
  • Zeeshan Ahmad + 3 more

&lt;p&gt;Electrocardiogram (ECG) is an authoritative source to diagnose and counter critical cardiovascular syndromes such as arrhythmia and myocardial infarction (MI). Current machine learning techniques either depend on manually extracted features or large and complex deep learning networks which merely utilize the 1D ECG signal directly. Since intelligent multimodal fusion can perform at the stateof-the-art level with an efficient deep network, therefore, in this paper, we propose two computationally efficient multimodal fusion frameworks for ECG heart beat classification called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF). At the input of these frameworks, we convert the raw ECG data into three different images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF). In MIF, we first perform image fusion by combining three imaging modalities to create a single image modality which serves as input to the Convolutional Neural Network (CNN). In MFF, we extracted features from penultimate layer of CNNs and fused them to get unique and interdependent information necessary for better performance of classifier. These informational features are finally used to train a Support Vector Machine (SVM) classifier for ECG heart-beat classification. We demonstrate the superiority of the proposed fusion models by performing experiments on PhysioNets MIT-BIH dataset for five distinct conditions of arrhythmias which are consistent with the AAMI EC57 protocols and on PTB diagnostics dataset for Myocardial Infarction (MI) classification. We achieved classification accuracy of 99.7% and 99.2% on arrhythmia and MI classification, respectively.&lt;/p&gt;

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 141
  • 10.1109/access.2021.3097614
ECG Heartbeat Classification Using Multimodal Fusion
  • Jan 1, 2021
  • IEEE Access
  • Zeeshan Ahmad + 3 more

Electrocardiogram (ECG) is an authoritative source to diagnose and counter critical cardiovascular syndromes such as arrhythmia and myocardial infarction (MI). Current machine learning techniques either depend on manually extracted features or large and complex deep learning networks which merely utilize the 1D ECG signal directly. Since intelligent multimodal fusion can perform at the state-of-the-art level with an efficient deep network, therefore, in this paper, we propose two computationally efficient multimodal fusion frameworks for ECG heart beat classification called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF). At the input of these frameworks, we convert the raw ECG data into three different images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF). In MIF, we first perform image fusion by combining three imaging modalities to create a single image modality which serves as input to the Convolutional Neural Network (CNN). In MFF, we extracted features from penultimate layer of CNNs and fused them to get unique and interdependent information necessary for better performance of classifier. These informational features are finally used to train a Support Vector Machine (SVM) classifier for ECG heart-beat classification. We demonstrate the superiority of the proposed fusion models by performing experiments on PhysioNet's MIT-BIH dataset for five distinct conditions of arrhythmias which are consistent with the AAMI EC57 protocols and on PTB diagnostics dataset for Myocardial Infarction (MI) classification. We achieved classification accuracy of 99.7% and 99.2% on arrhythmia and MI classification, respectively. Source code at https://github.com/zaamad/ECG-Heartbeat-Classification-Using-Multimodal-Fusion.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/3573428.3573472
Light Field Saliency Detection Based on Multi-modal Fusion
  • Oct 21, 2022
  • Ben Jiang + 2 more

Compared with RGB images, light field images contain more abundant visual information, which is helpful to accurately detect salient objects in complex scenes. However, most of the existing light field saliency detection methods use single light field data or do not fully consider the differences and complementarities between different light field data, resulting in insufficient multi-modal fusion. To address these issues, a multi-modal feature fusion network is proposed, which makes full use of the rich visual information in the light field images to realize the accurate saliency object detection. The proposed network consists of two parallel subnets, which are used to process the micro-lens image array and all-foucs image respectively. Then the light field refinement module is used to refine the feature map extracted from the micro-lens array stream, and finally the multi-modal feature fusion is realized by the light field attention module to predict saliency objects more accurately. In order to verify the effectiveness of proposed method, extensive comparison with several existing light field saliency detection algorithms is carried on both Lytro-Illum and LFSD datasets. Experimental results show that the proposed method is superior to others in all evaluation metrics on Lytro-Illum dataset, and has desired generalization abilities on LFSD dataset.

  • Conference Article
  • Cite Count Icon 22
  • 10.1109/itsc55140.2022.9922104
MAFF-Net: Filter False Positive for 3D Vehicle Detection with Multi-modal Adaptive Feature Fusion
  • Oct 8, 2022
  • Zehan Zhang + 7 more

3D vehicle detection based on multi-modal fusion is an important task of many applications such as autonomous driving. Although significant progress has been made, we still observe two aspects that calls for further improvement: First, what extra information can be obtained from the images to complement the point clouds in 3D detection tasks is seldom explored by previous works. Second, most fusion modules can only be used in their designed network, lacking universality. In this work, we propose PointAttentionFusion and DenseAttentionFusion: two end-to-end trainable single-stage multi-modal feature fusion approaches to adaptively combine RGB and point cloud modalities. Experimental results on the KITTI dataset demonstrate significant improvement in filtering false positive over the approaches using only point cloud data. Furthermore, the proposed methods can provide competitive results compared to the published state-of-the-art multi-modal methods in the KITTI benchmark. Both fusion modules are applicable in all voxel-based 3D detection architectures and similar improvements are expected.

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3672565
Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action Units
  • Sep 23, 2024
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Zhilei Liu + 5 more

Talking face generation is to synthesize a lip-synchronized talking face video by inputting an arbitrary face image and corresponding audio clips. The current talking face model can be divided into four parts: visual feature extraction, audio feature processing, multimodal feature fusion, and rendering module. For the visual feature extraction part, existing methods face the challenge of complex learning task with noisy features, this article introduces an attention-based disentanglement module to disentangle the face into Audio-face and Identity-face using speech-related facial action unit (AU) information. For the multimodal feature fusion part, existing methods ignore not only the interaction and relationship of cross-modal information but also the local driving information of the mouth muscles. This study proposes a novel generative framework that incorporates a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to enhance the learning of cross-modal features. The proposed method employs both audio- and speech-related facial AUs as driving information. Speech-related AU information can facilitate more accurate mouth movements. Given the high correlation between speech and speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. Finally, we present a diffusion model for the synthesis of talking face images. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy. Code is available at https://mftfg-au.github.io/Multimodal_Fusion/.

  • Research Article
  • Cite Count Icon 3
  • 10.11817/j.issn.1672-7347.2024.230248
Development and validation of a multi-modality fusion deep learning model for differentiating glioblastoma from solitary brain metastases.
  • Jan 28, 2024
  • Zhong nan da xue xue bao. Yi xue ban = Journal of Central South University. Medical sciences
  • Chunquan Li + 7 more

Glioblastoma (GBM) and brain metastases (BMs) are the two most common malignant brain tumors in adults. Magnetic resonance imaging (MRI) is a commonly used method for screening and evaluating the prognosis of brain tumors, but the specificity and sensitivity of conventional MRI sequences in differential diagnosis of GBM and BMs are limited. In recent years, deep neural network has shown great potential in the realization of diagnostic classification and the establishment of clinical decision support system. This study aims to apply the radiomics features extracted by deep learning techniques to explore the feasibility of accurate preoperative classification for newly diagnosed GBM and solitary brain metastases (SBMs), and to further explore the impact of multimodality data fusion on classification tasks. Standard protocol cranial MRI sequence data from 135 newly diagnosed GBM patients and 73 patients with SBMs confirmed by histopathologic or clinical diagnosis were retrospectively analyzed. First, structural T1-weight, T1C-weight, and T2-weight were selected as 3 inputs to the entire model, regions of interest (ROIs) were manually delineated on the registered three modal MR images, and multimodality radiomics features were obtained, dimensions were reduced using a random forest (RF)-based feature selection method, and the importance of each feature was further analyzed. Secondly, we used the method of contrast disentangled to find the shared features and complementary features between different modal features. Finally, the response of each sample to GBM and SBMs was predicted by fusing 2 features from different modalities. The radiomics features using machine learning and the multi-modal fusion method had a good discriminatory ability for GBM and SBMs. Furthermore, compared with single-modal data, the multimodal fusion models using machine learning algorithms such as support vector machine (SVM), Logistic regression, RF, adaptive boosting (AdaBoost), and gradient boosting decision tree (GBDT) achieved significant improvements, with area under the curve (AUC) values of 0.974, 0.978, 0.943, 0.938, and 0.947, respectively; our comparative disentangled multi-modal MR fusion method performs well, and the results of AUC, accuracy (ACC), sensitivity (SEN) and specificity(SPE) in the test set were 0.985, 0.984, 0.900, and 0.990, respectively. Compared with other multi-modal fusion methods, AUC, ACC, and SEN in this study all achieved the best performance. In the ablation experiment to verify the effects of each module component in this study, AUC, ACC, and SEN increased by 1.6%, 10.9% and 15.0%, respectively after 3 loss functions were used simultaneously. A deep learning-based contrast disentangled multi-modal MR radiomics feature fusion technique helps to improve GBM and SBMs classification accuracy.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant