Accurate industrial anomaly detection with efficient multimodal fusion
Accurate industrial anomaly detection with efficient multimodal fusion
- Research Article
3
- 10.1016/j.compbiomed.2024.108381
- Mar 27, 2024
- Computers in biology and medicine
LRFNet: A real-time medical image fusion method guided by detail information
- Book Chapter
- 10.71443/9788197933684-13
- Jan 31, 2025
The rapid advancements in deep neural networks (DNNs) have revolutionized multi-modal data fusion, paving the way for transformative applications in holistic security assessments. This book chapter explores the integration of diverse data modalities, such as visual, textual, and behavioral inputs, to enhance security systems' accuracy, robustness, and adaptability. The chapter delves into state-of-the-art DNN architectures, including hybrid models that combine Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers, to effectively process and fuse multi-modal data. Key challenges, such as balancing model complexity with fusion efficiency and addressing issues of scalability and real-time applicability, are critically analyzed. Advanced topics, including attention mechanisms for emphasizing relevant features and innovative fusion strategies, are discussed to provide actionable insights for developing intelligent security systems. Case studies, such as integrated facial and behavior recognition systems, demonstrate the efficacy of these approaches in real-world applications. By addressing the gaps in existing methodologies and proposing novel solutions, this chapter contributes significantly to advancing the field of multi-modal data fusion for security.
- Research Article
- 10.3390/s24227139
- Nov 6, 2024
- Sensors (Basel, Switzerland)
To address the issues of single-structured feature input channels, insufficient feature learning capabilities in noisy environments, and large model parameter sizes in intelligent diagnostic models for mechanical equipment, a lightweight and efficient multimodal feature fusion convolutional neural network (LEMFN) method is proposed. Compared with existing models, LEMFN captures rich fault features at multiple scales by combining time-domain and frequency-domain signals, thereby enhancing the model's robustness to noise and improving data adaptability under varying operating conditions. Additionally, the convolutional block attention module (CBAM) and random overlapping sampling technology (ROST) are introduced, and through a feature fusion strategy, the accurate diagnosis of mechanical equipment faults is achieved. Experimental results demonstrate that the proposed method not only possesses high diagnostic accuracy and rapid convergence but also exhibits strong robustness in noisy environments. Finally, a graphical user interface (GUI)-based mechanical equipment fault detection system was developed to promote the practical application of intelligent fault diagnosis in mechanical equipment.
- Preprint Article
- 10.32920/22734290.v1
- May 3, 2023
<p>Electrocardiogram (ECG) is an authoritative source to diagnose and counter critical cardiovascular syndromes such as arrhythmia and myocardial infarction (MI). Current machine learning techniques either depend on manually extracted features or large and complex deep learning networks which merely utilize the 1D ECG signal directly. Since intelligent multimodal fusion can perform at the stateof-the-art level with an efficient deep network, therefore, in this paper, we propose two computationally efficient multimodal fusion frameworks for ECG heart beat classification called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF). At the input of these frameworks, we convert the raw ECG data into three different images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF). In MIF, we first perform image fusion by combining three imaging modalities to create a single image modality which serves as input to the Convolutional Neural Network (CNN). In MFF, we extracted features from penultimate layer of CNNs and fused them to get unique and interdependent information necessary for better performance of classifier. These informational features are finally used to train a Support Vector Machine (SVM) classifier for ECG heart-beat classification. We demonstrate the superiority of the proposed fusion models by performing experiments on PhysioNets MIT-BIH dataset for five distinct conditions of arrhythmias which are consistent with the AAMI EC57 protocols and on PTB diagnostics dataset for Myocardial Infarction (MI) classification. We achieved classification accuracy of 99.7% and 99.2% on arrhythmia and MI classification, respectively.</p>
- Research Article
101
- 10.1109/access.2021.3097614
- Jan 1, 2021
- IEEE Access
Electrocardiogram (ECG) is an authoritative source to diagnose and counter critical cardiovascular syndromes such as arrhythmia and myocardial infarction (MI). Current machine learning techniques either depend on manually extracted features or large and complex deep learning networks which merely utilize the 1D ECG signal directly. Since intelligent multimodal fusion can perform at the state-of-the-art level with an efficient deep network, therefore, in this paper, we propose two computationally efficient multimodal fusion frameworks for ECG heart beat classification called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF). At the input of these frameworks, we convert the raw ECG data into three different images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF). In MIF, we first perform image fusion by combining three imaging modalities to create a single image modality which serves as input to the Convolutional Neural Network (CNN). In MFF, we extracted features from penultimate layer of CNNs and fused them to get unique and interdependent information necessary for better performance of classifier. These informational features are finally used to train a Support Vector Machine (SVM) classifier for ECG heart-beat classification. We demonstrate the superiority of the proposed fusion models by performing experiments on PhysioNet's MIT-BIH dataset for five distinct conditions of arrhythmias which are consistent with the AAMI EC57 protocols and on PTB diagnostics dataset for Myocardial Infarction (MI) classification. We achieved classification accuracy of 99.7% and 99.2% on arrhythmia and MI classification, respectively. Source code at https://github.com/zaamad/ECG-Heartbeat-Classification-Using-Multimodal-Fusion.
- Preprint Article
- 10.32920/22734290
- May 3, 2023
<p>Electrocardiogram (ECG) is an authoritative source to diagnose and counter critical cardiovascular syndromes such as arrhythmia and myocardial infarction (MI). Current machine learning techniques either depend on manually extracted features or large and complex deep learning networks which merely utilize the 1D ECG signal directly. Since intelligent multimodal fusion can perform at the stateof-the-art level with an efficient deep network, therefore, in this paper, we propose two computationally efficient multimodal fusion frameworks for ECG heart beat classification called Multimodal Image Fusion (MIF) and Multimodal Feature Fusion (MFF). At the input of these frameworks, we convert the raw ECG data into three different images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF). In MIF, we first perform image fusion by combining three imaging modalities to create a single image modality which serves as input to the Convolutional Neural Network (CNN). In MFF, we extracted features from penultimate layer of CNNs and fused them to get unique and interdependent information necessary for better performance of classifier. These informational features are finally used to train a Support Vector Machine (SVM) classifier for ECG heart-beat classification. We demonstrate the superiority of the proposed fusion models by performing experiments on PhysioNets MIT-BIH dataset for five distinct conditions of arrhythmias which are consistent with the AAMI EC57 protocols and on PTB diagnostics dataset for Myocardial Infarction (MI) classification. We achieved classification accuracy of 99.7% and 99.2% on arrhythmia and MI classification, respectively.</p>
- Research Article
12
- 10.1109/tnnls.2023.3311820
- Dec 1, 2024
- IEEE transactions on neural networks and learning systems
In recent years, deep-learning-based pixel-level unified image fusion methods have received more and more attention due to their practicality and robustness. However, they usually require a complex network to achieve more effective fusion, leading to high computational cost. To achieve more efficient and accurate image fusion, a lightweight pixel-level unified image fusion (L-PUIF) network is proposed. Specifically, the information refinement and measurement process are used to extract the gradient and intensity information and enhance the feature extraction capability of the network. In addition, these information are converted into weights to guide the loss function adaptively. Thus, more effective image fusion can be achieved while ensuring the lightweight of the network. Extensive experiments have been conducted on four public image fusion datasets across multimodal fusion, multifocus fusion, and multiexposure fusion. Experimental results show that L-PUIF can achieve better fusion efficiency and has a greater visual effect compared with state-of-the-art methods. In addition, the practicability of L-PUIF in high-level computer vision tasks, i.e., object detection and image segmentation, has been verified.
- Research Article
1
- 10.1609/aaai.v38i6.28448
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.
- Research Article
1
- 10.1016/j.cmpb.2024.108568
- Mar 1, 2025
- Computer methods and programs in biomedicine
The fusion of multi-modal data has been shown to significantly enhance the performance of deep learning models, particularly on medical data. However, missing modalities are common in medical data due to patient specificity, which poses a substantial challenge to the application of these models. This study aimed to develop a novel and efficient multi-modal fusion framework for medical datasets that maintains consistent performance, even in the absence of one or more modalities. In this paper, we fused three modalities: chest X-ray radiographs, history of present illness text, and tabular data such as demographics and laboratory tests. A multi-modal fusion module based on pooled bottleneck (PB) attention was proposed in conjunction with knowledge distillation (KD) for enhancing model inference in the case of missing modalities. In addition, we introduced a gradient modulation (GM) method to deal with the unbalanced optimization in multi-modal model training. Finally, we designed comparison and ablation experiments to evaluate the fusion effect, the model robustness to missing modalities, and the contribution of each component (PB, KD, and GM). The evaluation experiments were performed on the MIMIC-IV datasets with the task of predicting in-hospital mortality risk. Model performance was assessed using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). The proposed multi-modal fusion framework achieved an AUROC of 0.886 and AUPRC of 0.459, significantly surpassing the performance of baseline models. Even when one or two modalities were missing, our model consistently outperformed the reference models. Ablation of each of the three components resulted in varying degrees of performance degradation, highlighting their distinct contributions to the model's overall effectiveness. This innovative multi-modal fusion architecture has demonstrated robustness to missing modalities, and has shown excellent performance in fusing three medical modalities for patient outcome prediction. This study provides a novel idea for addressing the challenge of missing modalities and has the potential be scaled to additional modalities.
- Research Article
13
- 10.1007/s00500-022-07047-2
- Apr 8, 2022
- Soft Computing
With the recent advancement in the medical diagnostic tools, multi-modality medical images are extensively utilized as a lifesaving tool. An efficient fusion of medical images can improve the performance of various medical diagnostic tools. But, gathering of all modalities for a given patient is defined as an ill-posed problem as medical images suffer from poor visibility and frequent patient dropout. Therefore, in this paper, an efficient multi-modality image fusion model is proposed to fuse multi-modality medical images. To tune the hyper-parameters of the proposed model, a multi-objective differential evolution is used. The fusion factor and edge strength metrics are utilized to form a multi-objective fitness function. Performance of the proposed model is compared with nine competitive models over fifteen benchmark images. Performance analyses reveal that the proposed model outperforms the competitive fusion models.
- Research Article
- 10.3390/electronics14010009
- Dec 24, 2024
- Electronics
As an interdisciplinary field of natural language processing and computer vision, Visual Question Answering (VQA) has emerged as a prominent research focus in artificial intelligence. The core of the VQA task is to combine natural language understanding and image analysis to infer answers by extracting meaningful features from textual and visual inputs. However, most current models struggle to fully capture the deep semantic relationships between images and text owing to their limited capacity to comprehend feature interactions, which constrains their performance. To address these challenges, this paper proposes an innovative Trilinear Multigranularity and Multimodal Adaptive Fusion algorithm (TriMMF) that is designed to improve the efficiency of multimodal feature extraction and fusion in VQA tasks. Specifically, the TriMMF consists of three key modules: (1) an Answer Generation Module, which generates candidate answers by extracting fused features and leveraging question features to focus on critical regions within the image; (2) a Fine-grained and Coarse-grained Interaction Module, which achieves multimodal interaction between question and image features at different granularities and incorporates implicit answer information to capture complex multimodal correlations; and (3) an Adaptive Weight Fusion Module, which selectively integrates coarse-grained and fine-grained interaction features based on task requirements, thereby enhancing the model’s robustness and generalization capability. Experimental results demonstrate that the proposed TriMMF significantly outperforms existing methods on the VQA v1.0 and VQA v2.0 datasets, achieving state-of-the-art performance in question–answer accuracy. These findings indicate that the TriMMF effectively captures the deep semantic associations between images and text. The proposed approach provides new insights into multimodal interaction and fusion research, combining domain adaptation techniques to address a broader range of cross-domain visual question answering tasks.
- Research Article
2
- 10.1016/j.compmedimag.2024.102457
- Nov 14, 2024
- Computerized Medical Imaging and Graphics
Self-supervised multi-modal feature fusion for predicting early recurrence of hepatocellular carcinoma
- Book Chapter
1
- 10.1007/978-3-030-64559-5_13
- Jan 1, 2020
Multi-modal medical image fusion plays a significant role in clinical applications like noninvasive diagnosis and image-guided surgery. However, designing an efficient image fusion technique is still a challenging task. In this paper, we propose an improved multi-modal medical image fusion method to enhance the visual quality and contrast of the fused image. To achieve this work, the registered source images are firstly decomposed into low-frequency (LF) and several high-frequency (HF) sub-images via non-subsampled shearlet transform (NSST). Afterward, LF sub-images are combined using the proposed weight local features fusion rule based on local energy and standard deviation, while HF sub-images are fused based on the novel sum-modified-laplacien (NSML) technique. Finally, inversed NSST is applied to reconstruct the fused image. Furthermore, the proposed method is extended to color multi-modal image fusion that effectively restrains color distortion and enhances spatial and spectral resolutions. To evaluate the performance, various experiments conducted on different datasets of gray-scale and color images. Experimental results show that the proposed scheme achieves better performance than other state-of-art proposed algorithms in both visual effects and objective criteria.
- Conference Article
657
- 10.18653/v1/p18-1209
- Jan 1, 2018
Multimodal research is an emerging field of artificial intelligence, and one of the main research problems in this field is multimodal fusion. The fusion of multimodal data is the process of integrating multiple unimodal representations into one compact multimodal representation. Previous research in this field has exploited the expressiveness of tensors for multimodal representation. However, these methods often suffer from exponential increase in dimensions and in computational complexity introduced by transformation of input into tensor. In this paper, we propose the Low-rank Multimodal Fusion method, which performs multimodal fusion using low-rank tensors to improve efficiency. We evaluate our model on three different tasks: multimodal sentiment analysis, speaker trait analysis, and emotion recognition. Our model achieves competitive results on all these tasks while drastically reducing computational complexity. Additional experiments also show that our model can perform robustly for a wide range of low-rank settings, and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.
- Research Article
2
- 10.1016/j.fmre.2023.08.004
- Oct 10, 2023
- Fundamental Research
Nighttime traffic object detection via adaptively integrating event and frame domains
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.