Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Enhanced multimodal deep learning framework for emotion classification with Aquila optimizer based ensemble fusion

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Enhanced multimodal deep learning framework for emotion classification with Aquila optimizer based ensemble fusion

Similar Papers
  • Dissertation
  • 10.32657/10356/182346
Data efficient deep multimodal learning
  • Jan 1, 2025
  • Meng Shen

Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.

  • Research Article
  • Cite Count Icon 1
  • 10.1785/0120240255
A Multimodal Deep Learning Framework for Rapid Real-Time Earthquake Magnitude Classification: Applications to Earthquake Early Warning
  • May 29, 2025
  • Bulletin of the Seismological Society of America
  • Zhou Zheng + 5 more

Rapidly and accurately determining whether a seismic event is small or large is crucial for seismic hazard analysis and earthquake early warning systems (EEWSs) to predict potential damage in target areas. However, the lack of prior information about the earthquake source and the class imbalance due to the limited availability of large earthquake waveforms pose significant challenges for this task. To address this issue, we applied a random sliding window technique to extent the waveform data of large earthquakes. Meanwhile, we developed a multimodal deep learning framework (MDLFrame) to distinguish between small (M < 5.5) and large (M ≥ 5.5) earthquakes using data sets from Japan and China. The MDLFrame is capable of extracting useful temporal features and spatial features from the input three-component waveforms and ground-motion parameters. The test results demonstrated that the MDLFrame outperformed traditional empirical methods and unimodal deep learning models in both timeliness and accuracy of magnitude classification. Specifically, the MDLFrame achieved an accuracy of 96.11%. To further confirm the robustness of the MDLFrame for magnitude classification, we analyzed its performance on 220 independent seismic events. The results indicated that within 3 s after the P-wave arrival, the MDLFrame achieved an accuracy of 98.68% for small earthquakes and 93.21% for large ones. Our study demonstrates that multimodal deep learning has significant potential for extensive applications in EEW and seismology.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.jdent.2023.104588
Multi-modal deep learning for automated assembly of periapical radiographs
  • Jun 21, 2023
  • Journal of Dentistry
  • L Pfänder + 5 more

Multi-modal deep learning for automated assembly of periapical radiographs

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 33
  • 10.1186/s12859-019-3084-y
Multimodal deep representation learning for protein interaction identification and protein family classification
  • Dec 1, 2019
  • BMC Bioinformatics
  • Da Zhang + 1 more

BackgroundProtein-protein interactions(PPIs) engage in dynamic pathological and biological procedures constantly in our life. Thus, it is crucial to comprehend the PPIs thoroughly such that we are able to illuminate the disease occurrence, achieve the optimal drug-target therapeutic effect and describe the protein complex structures. However, compared to the protein sequences obtainable from various species and organisms, the number of revealed protein-protein interactions is relatively limited. To address this dilemma, lots of research endeavor have investigated in it to facilitate the discovery of novel PPIs. Among these methods, PPI prediction techniques that merely rely on protein sequence data are more widespread than other methods which require extensive biological domain knowledge.ResultsIn this paper, we propose a multi-modal deep representation learning structure by incorporating protein physicochemical features with the graph topological features from the PPI networks. Specifically, our method not only bears in mind the protein sequence information but also discerns the topological representations for each protein node in the PPI networks. In our paper, we construct a stacked auto-encoder architecture together with a continuous bag-of-words (CBOW) model based on generated metapaths to study the PPI predictions. Following by that, we utilize the supervised deep neural networks to identify the PPIs and classify the protein families. The PPI prediction accuracy for eight species ranged from 96.76% to 99.77%, which signifies that our multi-modal deep representation learning framework achieves superior performance compared to other computational methods.ConclusionTo the best of our knowledge, this is the first multi-modal deep representation learning framework for examining the PPI networks.

  • Research Article
  • Cite Count Icon 2
  • 10.1051/0004-6361/202553751
Estimation of age and metallicity for galaxies based on multi-modal deep learning
  • Jun 1, 2025
  • Astronomy & Astrophysics
  • Ping Li + 4 more

Aims. This study is aimed at deriving the age and metallicity of galaxies by proposing a novel multi-modal deep learning framework. This multi-modal framework integrates spectral and photometric data, offering advantages in cases where spectra are incomplete or unavailable. Methods. We propose a multi-modal learning method for estimating the age and metallicity of galaxies (MMLforGalAM). This method uses two modalities: spectra and photometric images as training samples. Its architecture consists of four models: a spectral feature extraction model (ℳ1), a simulated spectral feature generation model (ℳ2), an image feature extraction model (ℳ3), and a multi-modal attention regression model (ℳ4). Specifically, ℳ1 extracts spectral features associated with age and metallicity from spectra observed by the Sloan Digital Sky Survey (SDSS). These features are then used as labels to train ℳ2, which generates simulated spectral features for photometric images to address the challenge of missing observed spectra for some images. Overall, ℳ1 and ℳ2 provide a transformation from photometric to spectral features, with the goal of constructing a spectral representation of data pairs (photometric and spectral features) for multi-modal learning. Once ℳ2 is trained, MMLforGalAM can then be applied to scenarios with only images, even in the absence of spectra. Then, ℳ3 processes SDSS photometric images to extract features related to age and metallicity. Finally, ℳ4 combines the simulated spectral features from ℳ2 with the extracted image features from ℳ3 to predict the age and metallicity of galaxies. Results. Trained on 36278 galaxies from SDSS, our model predicts the stellar age and metallicity, with a scatter of 1σ = 0.1506 dex for age and 1 σ = 0.1402 dex for metallicity. Compared to a single-modal model trained using only images, the multi-modal approach reduces the scatter by 27% for age and 15% for metallicity.

  • Research Article
  • Cite Count Icon 1306
  • 10.1109/tgrs.2020.3016820
More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification
  • Aug 16, 2020
  • IEEE Transactions on Geoscience and Remote Sensing
  • Danfeng Hong + 6 more

Classification and identification of the materials lying over or beneath the Earth's surface have long been a fundamental but challenging research topic in geoscience and remote sensing (RS) and have garnered a growing concern owing to the recent advancements of deep learning techniques. Although deep networks have been successfully applied in single-modality-dominated classification tasks, yet their performance inevitably meets the bottleneck in complex scenes that need to be finely classified, due to the limitation of information diversity. In this work, we provide a baseline solution to the aforementioned difficulty by developing a general multimodal deep learning (MDL) framework. In particular, we also investigate a special case of multi-modality learning (MML) -- cross-modality learning (CML) that exists widely in RS image classification applications. By focusing on "what", "where", and "how" to fuse, we show different fusion strategies as well as how to train deep networks and build the network architecture. Specifically, five fusion architectures are introduced and developed, further being unified in our MDL framework. More significantly, our framework is not only limited to pixel-wise classification tasks but also applicable to spatial information modeling with convolutional neural networks (CNNs). To validate the effectiveness and superiority of the MDL framework, extensive experiments related to the settings of MML and CML are conducted on two different multimodal RS datasets. Furthermore, the codes and datasets will be available at https://github.com/danfenghong/IEEE_TGRS_MDL-RS, contributing to the RS community.

  • Research Article
  • 10.1007/s10278-025-01788-w
A Timeseries-based Multimodal Deep Learning Approach for Lung Nodule Growth Prediction.
  • Dec 16, 2025
  • Journal of imaging informatics in medicine
  • Duc-Khanh Nguyen + 4 more

Lung nodules, while often benign, can become significant health concerns if their growth is not monitored accurately. Predicting lung nodule growth is critical for improving patient outcomes and guiding clinical decision-making. This study aims to develop a Multimodal Deep Learning Approach to enhance the accuracy of lung nodule growth prediction by integrating time-series CT image data with demographics and nodule-specific features. Data were collected from the Far Eastern Memorial Hospital, Taiwan, including CT image sequences of lung nodules and patient demographics and nodule-specific features. Using this dataset, a Multimodal Deep Learning framework was developed and optimized. The model's performance was assessed using metrics such as Accuracy, Precision, Sensitivity, F1-score, and AUC. The proposed Multimodal Deep Learning framework substantially outperformed traditional machine learning and unimodal models. Among all configurations, the repeat frame strategy achieved the best overall performance, with an accuracy of 0.929, precision of 0.878, sensitivity of 0.908, F1-score of 0.878, and AUC of 0.977. Paired t-test analysis confirmed that these improvements were statistically significant (p < 0.05) compared to other multimodal variants and baseline models. These results highlight the model's ability to effectively integrate image, demographics, and nodule-specific features, leading to superior predictive accuracy and robust clinical decision-support potential. By using the time-series of CT image data, along with demographics and nodule-specific features, the proposed Multimodal Deep Learning provides a reliable tool for predicting lung nodule growth. This advancement has significant implications for lung nodule management, offering clinicians a robust and dependable resource to support medical decision-making and improve patient care. The findings highlight the transformative potential of deep learning techniques in critical healthcare domains.

  • Research Article
  • 10.1038/s41746-026-02445-4
A device-invariant multi-modal learning framework for respiratory disease classification.
  • Feb 26, 2026
  • NPJ digital medicine
  • Mo Yang + 10 more

Recent advances in cough sound analysis using deep learning techniques enable smartphone-based respiratory disease screening suitable for self-management care in a home setting, yet their utility is limited by device heterogeneity, population diversity, and challenges in multimodal integration. We propose a device-invariant, multimodal deep learning framework that jointly models cough acoustics, demographic data, and symptom descriptions for multi-label classification of adult respiratory diseases. To address the issues of device effect, an adversarial branch is embedded in the audio encoder to enforce device-invariant feature learning, while an invariant risk minimization-augmented loss enhances robustness to non-structural shifts. To evaluate the effectiveness of our proposed method, a real-world, multi-center dataset containing over 10,000 cases spanning seven major respiratory conditions was curated. On the tasks of individual respiratory disease identification for chronic obstructive pulmonary disease (COPD), lower respiratory tract infection (LRTI) and pulmonary shadows (PS), our method achieves superior performance with the area under the receiver operating characteristic curve (AUROC) of 0.9698, 0.8483 and 0.8720, respectively. It also shows promising results in identifying the presence of comorbidities for 7 respiratory diseases with an overall AUROC of 0.8907. More importantly, extensive experimental results demonstrate our method mitigates the issues of device effect and facilitates the cross-device generalization for cough-based respiratory disease diagnoses. This work demonstrates a scalable and transferable AI-based approach for cough-driven respiratory screening, emphasizing the importance of multimodal fusion and robust representation learning in advancing clinical applicability.

  • Research Article
  • Cite Count Icon 163
  • 10.1145/3545572
A Review on Methods and Applications in Multimodal Deep Learning
  • Feb 17, 2023
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Summaira Jabeen + 5 more

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.

  • Research Article
  • Cite Count Icon 60
  • 10.1016/j.imavis.2025.105509
A systematic review of intermediate fusion in multimodal deep learning for biomedical applications
  • May 1, 2025
  • Image and Vision Computing
  • Valerio Guarrasi + 6 more

Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.

  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.ejrs.2023.10.006
SinkholeNet: A novel RGB-slope sinkhole dataset and deep weakly-supervised learning framework for sinkhole classification and localization
  • Nov 16, 2023
  • The Egyptian Journal of Remote Sensing and Space Sciences
  • Amir Yavariabdi + 6 more

This paper proposes a novel multimodal deep weakly-supervised learning framework, SinkholeNet, to classify and localize sinkhole(s) in high-resolution RGB-slope aerial images. The SinkholeNet first employs a multimodal Convolutional Neural Network (CNN) architecture that simultaneously extracts features from the input RGB image and ground slope map and then fuses the extracted features. It then uses an improved ShuffleNet architecture on the fused features to classify patches as sinkholes or non-sinkholes. Finally, the last extracted feature maps, belonging to the sinkhole class, are used as input of gradient-weighted class activation mapping (Grad-CAM) to localize sinkhole(s) in a weakly-supervised setting. The proposed weakly-supervised framework intends to increase the available labeled data for training and decrease the cost of human annotation. We also introduce a novel publicly available weakly labeled sinkhole dataset comprising RGB-slope paired image patches to support reproducible research. The experimental results on the newly introduced dataset show that the SinkholeNet outperforms the other methods considered in this paper both for sinkhole classification and localization.

  • Supplementary Content
  • Cite Count Icon 28
  • 10.1093/genetics/iyae161
A review of multimodal deep learning methods for genomic-enabled predictionin plant breeding
  • Nov 5, 2024
  • Genetics
  • Osval A Montesinos-López + 9 more

Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.knosys.2024.111990
A Dynamic Multi-modal deep Reinforcement Learning framework for 3D Bin Packing Problem
  • May 24, 2024
  • Knowledge-Based Systems
  • Anhao Zhao + 2 more

A Dynamic Multi-modal deep Reinforcement Learning framework for 3D Bin Packing Problem

  • Abstract
  • 10.1136/annrheumdis-2024-eular.1222
AB0875 ARTIFICIAL INTELLIGENCE TO PREDICT DISEASE ACTIVITY USING A MULTIMODAL MODEL WITH MAGNETIC RESONANCE IMAGING AND LABORATORY RESULTS IN PATIENTS WITH AXIAL SPONDYLOARTHRITIS
  • Jun 1, 2024
  • Annals of the Rheumatic Diseases
  • H S Cha + 5 more

Background:Sacral magnetic resonance imaging (MRI) helps determine whether patients with axial spondyloarthritis (axSpA) have active disease by detecting sacroiliitis. However, there is no consensus on the extent to which sacroiliitis...

  • Research Article
  • Cite Count Icon 1
  • 10.32996/jcsts.2025.7.2.29
Multimodal Deep Learning for Alzheimer’s Disease Diagnosis: Integrating Neuroimaging and Genetic Data
  • Apr 23, 2025
  • Journal of Computer Science and Technology Studies
  • Md Jubaer Rahman + 1 more

Conventional diagnosis of Alzheimer’s disease (AD) has usually relied upon data from individual modalities, which inherently restricts how data can be comprehended for understanding the disease process. To this end, in the current study, we present a novel multimodal deep learning framework that integrates clinical assessments, genomic information and imaging characteristics to enhance diagnosis and disease staging. This study uses Contrastive Stack Denoising Autoencoder and 3D CNNs to represent genetic data (e.g. single nucleotide polymorphisms, or SNPs), clinical test scores, and MRI scans. In addition to the correct categorization of people into three groups, AD, MCI, and CN. Compared with existing interpretability methods, this method selects the most prominent features by clustering them and performing perturbation analysis. Using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), we show through experiments that our proposed deep learning framework outperforms traditional machine learning methods, including support vector machines, random forests, and k-nearest neighbors, for these imaging features. The multimodal model outperforms the single-modality models across all metrics, including accuracy, precision, recall, and F1 scores. This by itself validates the therapeutic relevance of the model, as it highlights classic AD proteins that are present in the disease, including the hippocampus, amygdala, and the Rey Auditory Verbal Learning Test (RAVLT), which are all widely known to be impacted in AD as per conventional medical knowledge of the disease.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant