Analysis of Multimodal Data Using Deep Learning and Machine Learning
A modality is an event or experience. Life is multimodal, see, hear, smell, feel, and taste. Multimodal experiences involve some world modalities. Artificial intelligence must grasp multimodal views to understand our surroundings. Multimodal machine learning models interact and correlate input from several modalities. It's a multi-disciplinary field with great potential. In this study, we analyze emerging multimodal machine learning technologies and categorize them scientifically rather than focusing on specific multimodal applications. Multimodal machine learning offers more potential and problems than classifications. Most multimodal learning research collects quantitative data from polls and surveys. This research reviews a detailed library of observational studies on multimodal data (MMD) skills for human learning using artificial intelligence-powered approaches including Machine Learning and Deep Learning. This research also describes how MMD has improved learning and in what environments. This paper discusses multimodal learning and its ongoing improvements and approaches to improving learning. Finally, future researchers should carefully consider building a system that aligns multimodal aspects with the study and learning plan. These elements could enhance multimodal learning by facilitating theory and practice activities. This research lays the groundwork for multimodal data use in future learning technologies and development.
- Dissertation
- 10.32657/10356/182346
- Jan 1, 2025
Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.
- Research Article
4
- 10.1016/j.jdent.2023.104588
- Jun 21, 2023
- Journal of Dentistry
Multi-modal deep learning for automated assembly of periapical radiographs
- Research Article
6
- 10.13374/j.issn2095-9389.2019.03.21.003
- May 1, 2020
- SHILAP Revista de lepidopterología
“Big data” is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand “big data”. The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person’s understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of “media”, a “mode” is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field.
- Conference Article
10
- 10.1145/3580305.3599208
- Aug 4, 2023
The recent advancements in machine learning and artificial intelligence (particularly foundation models such as BERT, GPT-3, T5, ResNet, etc.) have demonstrated remarkable capabilities and driven significant revolutionary changes to the way we make inferences from complex data. These models represent a fundamental shift in the way data are approached and offer exciting new research directions and opportunities for multimodal learning and data fusion. Given the potential of foundation models to transform the field of multimodal learning, there is a need to bring together experts and researchers to discuss the latest developments in this area, exchange ideas, and identify key research questions and challenges that need to be addressed. By hosting this workshop, we aim to create a forum for researchers to share their insights and expertise on multimodal data fusion and learning using foundation models, and to explore potential new research directions and applications in the rapidly evolving field. We expect contributions from interdisciplinary researchers to study and model interactions between (but not limited to) modalities of language, graphs, time-series, vision, tabular data, sensors, and more. Our workshop will emphasize interdisciplinary work and aim at seeding cross-team collaborations around new tasks, datasets, and models.
- Research Article
55
- 10.3390/electronics12071558
- Mar 26, 2023
- Electronics
Machine Learning (ML) and Deep Learning (DL) are derivatives of Artificial Intelligence (AI) that have already demonstrated their effectiveness in a variety of domains, including healthcare, where they are now routinely integrated into patients’ daily activities. On the other hand, data heterogeneity has long been a key obstacle in AI, ML and DL. Here, Multimodal Machine Learning (Multimodal ML) has emerged as a method that enables the training of complex ML and DL models that use heterogeneous data in their learning process. In addition, Multimodal ML enables the integration of multiple models in the search for a single, comprehensive solution to a complex problem. In this review, the technical aspects of Multimodal ML are discussed, including a definition of the technology and its technical underpinnings, especially data fusion. It also outlines the differences between this technology and others, such as Ensemble Learning, as well as the various workflows that can be followed in Multimodal ML. In addition, this article examines in depth the use of Multimodal ML in the detection and prediction of Cardiovascular Diseases, highlighting the results obtained so far and the possible starting points for improving its use in the aforementioned field. Finally, a number of the most common problems hindering the development of this technology and potential solutions that could be pursued in future studies are outlined.
- Research Article
35
- 10.1145/3713070
- Feb 20, 2025
- ACM Computing Surveys
The multimodal interplay of the five fundamental senses—Sight, Hearing, Smell, Taste, and Touch—provides humans with superior environmental perception and learning skills. Adapted from the human perceptual system, multimodal machine learning tries to incorporate different forms of input, such as image, audio, and text, and determine their fundamental connections through joint modeling. As one of the future development forms of artificial intelligence, it is necessary to summarize the progress of multimodal machine learning. In this article, we start with the form of a multimodal combination and provide a comprehensive survey of the emerging subject of multimodal machine learning, covering representative research approaches, the most recent advancements, and their applications. Specifically, this article analyzes the relationship between different modalities in detail and sorts out the key issues in multimodal research from the application scenarios. Besides, we thoroughly reviewed state-of-the-art methods and datasets covered in multimodal learning research. We then identify the substantial challenges and potential developing directions in this field. Finally, given the comprehensive nature of this survey, both modality-specific and task-specific researchers can benefit from this survey and advance the field.
- Research Article
123
- 10.1016/j.inffus.2023.102217
- Dec 30, 2023
- Information Fusion
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges
- Research Article
- 10.1038/s41598-026-36296-6
- Jan 20, 2026
- Scientific reports
Multimodal Machine Learning (MML) methods address various efficient ways of driving insights from various data modalities, e.g., in healthcare settings, tabular electronic health records along with other modalities, such as medical imaging, electrocardiogram data (ECG), and textual doctors' notes and reports. Using deep learning methods, we propose a novel MML approach for mortality prediction in healthcare settings that fuses tabular data, ECG, and written notes in various stages. To this end, this research addresses various challenges related to MML including (1) collecting and building comprehensive data representations from various modalities that may require different preprocessing steps to handle noise and distorted data, (2) ensuring data alignment across modalities, and (3) choosing the optimal fusion strategy (i.e., early, late, or hybrid). This study uses three distinct data modalities: tabular data (encompassing healthcare records, vital signs in real-time, laboratory test results, procedures, and diagnosis records), ECG data, and textual notes from doctors about patients. These modalities are obtained from the MIMIC-IV, MIMIC-ECG, and MIMIC-IV-Note datasets, which include comprehensive medical records, ECG reports, and textual doctors' notes to explore and evaluate methods in all MML stages. The methodology includes data preprocessing to address noise, outliers, and missing values. It involves comparing fusion strategies (early, late, hybrid) for integrating multimodal data. In addition, novel deep learning models that use attention mechanisms are implemented for better data interaction. Model performance is evaluated with metrics like AUC-ROC, precision, recall, and F-score. The results of our proposed multimodal neural network model using multimodal information showed a substantial increase in performance, with an AUC of 0.96, surpassing the performance of previous single modality literature models. Using multimodal data, the aim is to make the proposed model obtain a holistic view of patient health similar to that of domain experts, resulting in better informed clinical decisions and potentially better clinical outcomes. Our promising results suggest the need to examine biases in training data, such as mortality class imbalances, to improve model performance. Future work should also address the interpretability of complex deep learning models for clinical adoption.
- Research Article
403
- 10.1007/s00371-021-02166-7
- Jun 10, 2021
- The Visual Computer
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.
- Research Article
53
- 10.1016/j.ins.2022.12.014
- Dec 9, 2022
- Information Sciences
Analysis of multimodal data fusion from an information theory perspective
- Research Article
4
- 10.2196/72822
- May 12, 2025
- Journal of medical Internet research
A major challenge in sentiment analysis on social media is the increasing prevalence of image-based content, which integrates text and visuals to convey nuanced messages. Traditional text-based approaches have been widely used to assess public attitudes and beliefs; however, they often fail to fully capture the meaning of multimodal content where cultural, contextual, and visual elements play a significant role. This study aims to provide practical guidance for collecting, processing, and analyzing social media data using multimodal machine learning models. Specifically, it focuses on training and fine-tuning models to classify sentiment and detect hate speech. Social media data were collected from Facebook and Instagram using CrowdTangle, a public insights tool by Meta, and from X via its academic research application programming interface. The dataset was filtered to include only race-related terms and lesbian, gay, bisexual, transgender, queer, intersex, and asexual community-related posts with image attachments, ensuring focus on multimodal content. Human annotators labeled 13,000 posts into 4 categories: negative sentiment, positive sentiment, hate, or antihate. We evaluated unimodal (Bidirectional Encoder Representations from Transformers for text and Visual Geometry Group 16 for images) and multimodal (Contrastive Language-Image Pretraining [CLIP], Visual Bidirectional Encoder Representations from Transformers [VisualBERTs], and an intermediate fusion) models. To enhance model performance, the synthetic minority oversampling technique was applied to address class imbalances, and latent Dirichlet allocation was used to improve semantic representations. Our findings highlighted key differences in model performance. Among unimodal models, Bidirectional Encoder Representations from Transformer outperformed Visual Geometry Group 16, achieving higher accuracy and macro-F1-scores across all tasks. Among multimodal models, CLIP achieved the highest accuracy (0.86) in negative sentiment detection, followed by VisualBERT (0.84). For positive sentiment, VisualBERT outperformed other models with the highest accuracy (0.76). In hate speech detection, the intermediate fusion model demonstrated the highest accuracy (0.91) with a macro-F1-score of 0.64, ensuring balanced performance. Meanwhile, VisualBERT performed best in antihate classification, achieving an accuracy of 0.78. Applying latent Dirichlet allocation and the synthetic minority oversampling technique improved minority class detection, particularly for antihate content. Overall, the intermediate fusion model provided the most balanced performance across tasks, while CLIP excelled in accuracy-driven classifications. Although VisualBERT performed well in certain areas, it struggled to maintain a precision-recall balance. These results emphasized the effectiveness of multimodal approaches over unimodal models in analyzing social media sentiment. This study contributes to the growing research on multimodal machine learning by demonstrating how advanced models, data augmentation techniques, and diverse datasets can enhance the analysis of social media content. The findings offer valuable insights for researchers, policy makers, and public health professionals seeking to leverage artificial intelligence for social media monitoring and addressing broader societal challenges.
- Conference Article
45
- 10.1109/icmlc48188.2019.8949228
- Jul 1, 2019
Representation learning is the base and crucial for consequential tasks, such as classification, regression, and recognition. The goal of representation learning is to automatically learning good features with deep models. Multimodal representation learning is a special representation learning, which automatically learns good features from multiple modalities, and these modalities are not independent, there are correlations and associations among modalities. Furthermore, multimodal data are usually heterogeneous. Due to the characteristics, multimodal representation learning poses many difficulties: how to combine multimodal data from heterogeneous sources; how to jointly learning features from multimodal data; how to effectively describe the correlations and associations, etc. These difficulties triggered great interest of researchers along with the upsurge of deep learning, many deep multimodal learning methods have been proposed by different researchers. In this paper, we present an overview of deep multimodal learning, especially the approaches proposed within the last decades. We provide potential readers with advances, trends and challenges, which can be very helpful to researchers in the field of machine, especially for the ones engaging in the study of multimodal deep machine learning.
- Supplementary Content
14
- 10.1002/advs.202406242
- Sep 11, 2024
- Advanced Science
Multimodal machine learning, as a prospective advancement in artificial intelligence, endeavors to emulate the brain's multimodal learning abilities with the objective to enhance interactions with humans. However, this approach requires simultaneous processing of diverse types of data, leading to increased model complexity, longer training times, and higher energy consumption. Multimodal neuromorphic devices have the capability to preprocess spatio‐temporal information from various physical signals into unified electrical signals with high information density, thereby enabling more biologically plausible multimodal learning with low complexity and high energy‐efficiency. Here, this work conducts a comparison between the expression of multimodal machine learning and multimodal neuromorphic computing, followed by an overview of the key characteristics associated with multimodal neuromorphic devices. The bio‐plausible operational principles and the multimodal learning abilities of emerging devices are examined, which are classified into heterogeneous and homogeneous multimodal neuromorphic devices. Subsequently, this work provides a detailed description of the multimodal learning capabilities demonstrated by neuromorphic circuits and their respective applications. Finally, this work highlights the limitations and challenges of multimodal neuromorphic computing in order to hopefully provide insight into potential future research directions.
- Research Article
739
- 10.1162/neco_a_01273
- May 1, 2020
- Neural Computation
With the wide deployments of heterogeneous networks, huge amounts of data with characteristics of high volume, high variety, high velocity, and high veracity are generated. These data, referred to multimodal big data, contain abundant intermodality and cross-modality information and pose vast challenges on traditional data fusion methods. In this review, we present some pioneering deep learning models to fuse these multimodal big data. With the increasing exploration of the multimodal big data, there are still some challenges to be addressed. Thus, this review presents a survey on deep learning for multimodal data fusion to provide readers, regardless of their original community, with the fundamentals of multimodal deep learning fusion method and to motivate new multimodal data fusion techniques of deep learning. Specifically, representative architectures that are widely used are summarized as fundamental to the understanding of multimodal deep learning. Then the current pioneering multimodal data fusion deep learning models are summarized. Finally, some challenges and future topics of multimodal data fusion deep learning models are described.
- Research Article
- 10.55041/ijsrem52491
- Sep 9, 2025
- INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Abstract— Processing data-driven healthcare allowed us unprecedented chances to enhance diagnoses, foreseen, and customized treatment by means of multi-modal learning. The present paper discusses the development of electronic health records (EHR), medical images, and genomic data through multi-modal deep learning. Multi-modal models are able to capture richer feature representations and more complex patterns not visible with unimodal processing through the use of heterogeneous data sources, and thus by combining their complementary strengths. We propose an end-to-end protocol to align, preprocess, and fuse modalities and demonstrate an application of deep neural networks learning in tandem about these structured pieces of EHR and high dimensional imaging attributes alongside gene expression data. Through experiments, it is revealed that the proposed model has better performance on the task of disease classification and patient stratification compared to single-modality counterparts. The paper highlights the need to not only ensure data alignment, imputation of missing modalities and learning representations specifically in the domain of modalities to fully utilize multi-modal in the clinical context. Keywords— Multi-modal Learning, Electronic Health Records (EHR), Medical Imaging, Genomic Data, Deep Learning, Data Fusion, Healthcare AI, Precision Medicine, Patient Stratification, Biomedical Informatics.