Sarcasm detection in hotel reviews: a multimodal deep learning approach
酒店评论中的讽刺检测:一种多模态深度学习方法研究目的本研究通过分析酒店评论文本和图像之间情感特征的不一致性来检测消费者的讽刺。研究方法本文提出了一种基于多模态深度学习的讽刺检测模型, 使用从两个旅行平台收集的三个酒店品牌的评论, 该模型能够识别模态内部和模态之间的情感不一致性。利用图神经网络(GNN)探索文本-图像交互信息, 以检测讽刺情感中的关键线索。研究发现研究结果显示, 多模态深度学习模型优于其他基线模型, 这有助于理解酒店服务评估, 并为酒店经理提供决策建议。研究创新该研究可以在两方面帮助酒店业者:检测服务质量和制定策略。通过选择参考酒店品牌, 酒店业者可以更好地评估其服务质量水平(随之而来的是最佳资源分配), 因此, 讽刺检测研究不仅有助于寻求提高服务质量的酒店经理。本研究介绍的多模态深度学习方法可以在其他行业复制, 帮助旅行平台优化其产品和服务。
- Research Article
- 10.1007/s10278-025-01788-w
- Dec 16, 2025
- Journal of imaging informatics in medicine
Lung nodules, while often benign, can become significant health concerns if their growth is not monitored accurately. Predicting lung nodule growth is critical for improving patient outcomes and guiding clinical decision-making. This study aims to develop a Multimodal Deep Learning Approach to enhance the accuracy of lung nodule growth prediction by integrating time-series CT image data with demographics and nodule-specific features. Data were collected from the Far Eastern Memorial Hospital, Taiwan, including CT image sequences of lung nodules and patient demographics and nodule-specific features. Using this dataset, a Multimodal Deep Learning framework was developed and optimized. The model's performance was assessed using metrics such as Accuracy, Precision, Sensitivity, F1-score, and AUC. The proposed Multimodal Deep Learning framework substantially outperformed traditional machine learning and unimodal models. Among all configurations, the repeat frame strategy achieved the best overall performance, with an accuracy of 0.929, precision of 0.878, sensitivity of 0.908, F1-score of 0.878, and AUC of 0.977. Paired t-test analysis confirmed that these improvements were statistically significant (p < 0.05) compared to other multimodal variants and baseline models. These results highlight the model's ability to effectively integrate image, demographics, and nodule-specific features, leading to superior predictive accuracy and robust clinical decision-support potential. By using the time-series of CT image data, along with demographics and nodule-specific features, the proposed Multimodal Deep Learning provides a reliable tool for predicting lung nodule growth. This advancement has significant implications for lung nodule management, offering clinicians a robust and dependable resource to support medical decision-making and improve patient care. The findings highlight the transformative potential of deep learning techniques in critical healthcare domains.
- Research Article
7
- 10.1038/s41598-025-10512-1
- Jul 12, 2025
- Scientific Reports
This paper presents an approach for adaptive scheduling and robustness optimization in global logistics networks by integrating multimodal deep reinforcement learning with Internet of Things (IoT) technologies. We propose an integrated framework comprising a multimodal data fusion mechanism that synthesizes heterogeneous IoT sensor data, historical records, and contextual information; an adaptive deep reinforcement learning architecture that generates dynamic scheduling policies; and a multi-objective robust optimization method that balances operational efficiency with system resilience. The framework addresses key challenges in global logistics including demand volatility, transportation disruptions, and environmental uncertainties. Comprehensive experiments conducted on real-world logistics datasets demonstrate that our approach outperforms traditional methods with an 18.7% reduction in operational costs, 12.4% improvement in service levels, and significantly enhanced robustness under various disruption scenarios. The proposed method maintains 83% performance stability during complex disruptions compared to 51–72% for alternative approaches, while keeping computational requirements feasible for practical deployment. This research demonstrates potential contributions to AI-driven logistics operations management by showing improved supply chain performance through multimodal learning and robust optimization techniques.
- Research Article
- 10.55041/ijsrem59205
- Apr 5, 2026
- INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Abstract—The spread of fake news on social media is a big threat to public discourse, democracy, and trust in society. Conventional unimodal methodologies that depend exclusively on textual content have demonstrated inadequacy in encapsulating the intricate dynamics of misinformation dissemination. This paper shows a full multimodal deep learning framework that combines text with social signals to help find fake news more easily. We use the latest transformer architectures to encode text, graph neural networks to model how social information spreads, and adaptive fusion mechanisms to combine content features with social context. The proposed methodology addresses significant deficiencies in the current literature, specifically the insufficient acquisition of structural social information and the discordance between content and social modalities. By systematically an- alyzing recent studies, we show that multimodal approaches always do better than unimodal baselines. For example, on benchmark datasets, the accuracies were 94.3% and the F1-scores were 92.8%. This work integrates contemporary methodological trends, delineates enduring research deficiencies, and introduces an innovative framework that enhances the forefront of auto- mated fake news detection by adeptly modeling the interaction between content semantics and social propagation dynamics. Keywords: Fake news detection, multimodal deep learning, social signals, graph neural networks, transformer models, mis- information detection Index Terms—component, formatting, style, styling, insert
- Supplementary Content
28
- 10.1093/genetics/iyae161
- Nov 5, 2024
- Genetics
Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.
- Research Article
160
- 10.1145/3545572
- Feb 17, 2023
- ACM Transactions on Multimedia Computing, Communications, and Applications
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.
- Dissertation
- 10.32657/10356/182346
- Jan 1, 2025
Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.
- Research Article
34
- 10.1016/j.compbiomed.2023.107288
- Aug 1, 2023
- Computers in Biology and Medicine
DeepCIP: A multimodal deep learning method for the prediction of internal ribosome entry sites of circRNAs
- Research Article
4
- 10.1016/j.jdent.2023.104588
- Jun 21, 2023
- Journal of Dentistry
Multi-modal deep learning for automated assembly of periapical radiographs
- Research Article
5
- 10.3390/curroncol31110530
- Nov 15, 2024
- Current oncology (Toronto, Ont.)
Prostate cancer (PCa) is a clinically heterogeneous disease. Predicting clinically significant PCa with low-intermediate prostate-specific antigen (PSA), which often includes aggressive cancers, is imperative. This study evaluated the predictive accuracy of deep learning analysis using multimodal medical data focused on clinically significant PCa in patients with PSA ≤ 20 ng/mL. Our cohort study included 178 consecutive patients who underwent ultrasound-guided prostate biopsy. Deep learning analyses were applied to predict clinically significant PCa. We generated receiver operating characteristic curves and calculated the corresponding area under the curve (AUC) to assess the prediction. The AUC of the integrated medical data using our multimodal deep learning approach was 0.878 (95% confidence interval [CI]: 0.772-0.984) in all patients without PSA restriction. Despite the reduced predictive ability of PSA when restricted to PSA ≤ 20 ng/mL (n = 122), the AUC was 0.862 (95% CI: 0.723-1.000), complemented by imaging data. In addition, we assessed clinical presentations and images belonging to representative false-negative and false-positive cases. Our multimodal deep learning approach assists physicians in determining treatment strategies by predicting clinically significant PCa in patients with PSA ≤ 20 ng/mL before biopsy, contributing to personalized medical workflows for PCa management.
- Research Article
1
- 10.1158/1538-7445.am2024-2313
- Mar 22, 2024
- Cancer Research
Purpose: Cancer patients routinely undergo radiologic and pathologic evaluation for their diagnostic workup. These data modalities represent a valuable and readily available resource for developing new prognostic tools. Given their vast difference in spatial scales, effective methods to integrate the two modalities are currently lacking. Here, we aim to develop a multi-modal approach to integrate radiology and pathology images for predicting outcomes in cancer patients. Methods: We propose a multi-modal weakly-supervised deep learning framework to integrate radiology and pathology images for survival prediction. We first extract multi-scale features from whole-slide H&E-stained pathology images to characterize cellular and tissue phenotypes as well as spatial cellular organization. We then build a hierarchical co-attention transformer to effectively learn the multi-modal interactions between radiology and pathology image features. Finally, a multimodal risk score is derived by combining complementary information from two images modalities and clinical data for predicting outcome. We evaluate our approach in lung, gastric, and brain cancers with matched radiology and pathology images and clinical data available, each with separate training and external validation cohorts. Results: The multi-modal deep learning models achieved a reasonably high accuracy for predicting survival outcomes in the external validation cohorts (C-index range: 0.72-0.75 across three cancer types). The multi-modal prognostic models significantly improved upon single-modal approach based on radiology or pathology images or clinical data alone (C-index range: 0.53-0.71, P&lt;0.01). The multi-modal deep learning models were significantly associated with disease-free survival and overall survival (hazard ratio range: 3.23-4.46, P&lt;0.0001). In multivariable analyses, the models remained an independent prognostic factor (P&lt;0.01) after adjusting for clinicopathological variables including cancer stage and tumor differentiation. Conclusions: The proposed multi-modal deep learning approach outperforms traditional methods for predicting survival outcomes by leveraging routinely available radiology and pathology images. With further independent validation, this may afford a promising approach to improve risk stratification and better inform treatment strategies for cancer patients. Citation Format: Zhe Li, Yuming Jiang, Ruijiang Li. Multi-modal deep learning to predict cancer outcomes by integrating radiology and pathology images [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2313.
- Research Article
58
- 10.1016/j.imavis.2025.105509
- May 1, 2025
- Image and Vision Computing
Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.
- Research Article
139
- 10.2991/ijcis.d.200120.001
- Jan 1, 2020
- International Journal of Computational Intelligence Systems
Traffic flow forecasting has been regarded as a key problem of intelligent transport systems. In this work, we propose a hybrid multimodal deep learning method for short-term traffic flow forecasting, which can jointly and adaptively learn the spatial–temporal correlation features and long temporal interdependence of multi-modality traffic data by an attention auxiliary multimodal deep learning architecture. According to the highly nonlinear characteristics of multi-modality traffic data, the base module of our method consists of one-dimensional convolutional neural networks (1D CNN) and gated recurrent units (GRU) with the attention mechanism. The former is to capture the local trend features and the latter is to capture the long temporal dependencies. Then, we design a hybrid multimodal deep learning framework for fusing share representation features of different modality traffic data by multiple CNN-GRU-Attention modules. The experimental results indicate that the proposed multimodal deep learning model is capable of dealing with complex nonlinear urban traffic flow forecasting with satisfying accuracy and effectiveness.
- Research Article
3
- 10.2174/0115748936289033240424071522
- May 1, 2025
- Current Bioinformatics
Background: Cancer has emerged as the "leading killer" of human health. Survival prediction is a crucial branch of cancer prognosis. It aims to estimate patients' survival risk based on their disease conditions. Accurate and efficient survival prediction is vital in cancer patients' treatment and clinical management, preventing unnecessary suffering and conserving precious medical resources. Deep learning has been extensively applied in cancer diagnosis, prognosis, and treatment management. The decreasing cost of next-generation sequencing, continuous development of related databases, and in-depth research on multimodal deep learning have provided opportunities for establishing more functionally rich and accurate survival prediction models. Objective: The current area of cancer survival prediction still lacks a review of multimodal deep learning methods. Methods: We conducted a statistical analysis of the relevant research on multimodal deep learning for cancer survival prediction. We first filtered keywords from 6 known relevant papers. Then, we searched PubMed and Google Scholar for relevant publications from 2018 to 2022 using "Multimodal", "Deep Learning" and "Cancer Survival Prediction" as keywords. Then, we further searched the related publications through the backward and forward citation search. Subsequently, we conducted a detailed analysis and review of these studies based on their datasets and methods. Results: We present a comprehensive systematic review of the multimodal deep learning research on cancer survival prediction from 2018 to 2022. Conclusion: Multimodal deep learning has demonstrated powerful data aggregation capabilities and excellent performance in improving cancer survival prediction greatly. It has made a significant positive impact on facilitating the advancement of automated cancer diagnosis and precision oncology.
- Research Article
4
- 10.2196/55825
- Feb 7, 2025
- JMIR Medical Informatics
BackgroundChronic kidney disease (CKD) is a prevalent condition with significant global health implications. Early detection and management are critical to prevent disease progression and complications. Deep learning (DL) models using retinal images have emerged as potential noninvasive screening tools for CKD, though their performance may be limited, especially in identifying individuals with proteinuria and in specific subgroups.ObjectiveWe aim to evaluate the efficacy of integrating retinal images and urine dipstick data into DL models for enhanced CKD diagnosis.MethodsThe 3 models were developed and validated: eGFR-RIDL (estimated glomerular filtration rate–retinal image deep learning), eGFR-UDLR (logistic regression using urine dipstick data), and eGFR-MMDL (multimodal deep learning combining retinal images and urine dipstick data). All models were trained to predict an eGFR<60 mL/min/1.73 m², a key indicator of CKD, calculated using the 2009 CKD-EPI (Chronic Kidney Disease Epidemiology Collaboration) equation. This study used a multicenter dataset of participants aged 20‐79 years, including a development set (65,082 people) and an external validation set (58,284 people). Wide Residual Networks were used for DL, and saliency maps were used to visualize model attention. Sensitivity analyses assessed the impact of numerical variables.ResultseGFR-MMDL outperformed eGFR-RIDL in both the test and external validation sets, with area under the curves of 0.94 versus 0.90 and 0.88 versus 0.77 (P<.001 for both, DeLong test). eGFR-UDLR outperformed eGFR-RIDL and was comparable to eGFR-MMDL, particularly in the external validation. However, in the subgroup analysis, eGFR-MMDL showed improvement across all subgroups, while eGFR-UDLR demonstrated no such gains. This suggested that the enhanced performance of eGFR-MMDL was not due to urine data alone, but rather from the synergistic integration of both retinal images and urine data. The eGFR-MMDL model demonstrated the best performance in individuals younger than 65 years or those with proteinuria. Age and proteinuria were identified as critical factors influencing model performance. Saliency maps indicated that urine data and retinal images provide complementary information, with urine offering insights into retinal abnormalities and retinal images, particularly the arcade vessels, being key for predicting kidney function.ConclusionsThe MMDL model integrating retinal images and urine dipstick data show significant promise for noninvasive CKD screening, outperforming the retinal image–only model. However, routine blood tests are still recommended for individuals aged 65 years and older due to the model’s limited performance in this age group.
- Research Article
4
- 10.1016/j.ultramic.2022.113519
- Mar 29, 2022
- Ultramicroscopy
Revealing geometrically necessary dislocation density from electron backscatter patterns via multi-modal deep learning