A Hybrid Method for Traffic Flow Forecasting Using Multimodal Deep Learning
Traffic flow forecasting has been regarded as a key problem of intelligent transport systems. In this work, we propose a hybrid multimodal deep learning method for short-term traffic flow forecasting, which can jointly and adaptively learn the spatial–temporal correlation features and long temporal interdependence of multi-modality traffic data by an attention auxiliary multimodal deep learning architecture. According to the highly nonlinear characteristics of multi-modality traffic data, the base module of our method consists of one-dimensional convolutional neural networks (1D CNN) and gated recurrent units (GRU) with the attention mechanism. The former is to capture the local trend features and the latter is to capture the long temporal dependencies. Then, we design a hybrid multimodal deep learning framework for fusing share representation features of different modality traffic data by multiple CNN-GRU-Attention modules. The experimental results indicate that the proposed multimodal deep learning model is capable of dealing with complex nonlinear urban traffic flow forecasting with satisfying accuracy and effectiveness.
- Research Article
3
- 10.28932/jutisi.v1i3.414
- Dec 30, 2015
- Jurnal Teknik Informatika dan Sistem Informasi
— Traffic flow forecasting is one important part in Intelligent Transportation System. There are many methods had been developed for time series and traffic flow forecasting such as: Autoregressive Moving Average (ARIMA), Artificial Neural Network (ANN), and Support Vector Regression (SVR). SVR performance depend on kernel function and parameters of those kernel and data characteristic used in SVR as well. This research proposed hybrid method for traffic flow data clustering and forecasting. Fuzzy C-means is used in order to minimize the variance in whole dataset. Particle Swarm Optimization (PSO) is used in order to select the appropriate parameters for SVR. Experimental result shows the proposed method give MAPE below 4% in all test sites. Keywords—fuzzy c-means, particle swarm optimization, prediksi data lalu lintas, support vector regression, time-series.
- Research Article
12
- 10.3390/app12157477
- Jul 26, 2022
- Applied Sciences
This estimation method operates by integrating the input values that are redundantly collected from heterogeneous devices through the selection of a representative value and estimating missing values by using a multimodal RNN. Users use a heterogeneous healthcare platform mainly in a mobile environment. Users who pay a relatively large amount of attention to healthcare possess various types of healthcare devices and collect data through their mobile devices. The collected data may be duplicated depending on the types of these devices. This data duplication causes an ambiguity issue in that it is difficult to determine which value among multiple data should be taken as the user’s actual value. Accordingly, it is necessary to create a neural network structure that considers the data value at the time previous to the current time. RNNs are appropriate for handling data with a time series characteristic. To learn an RNN-based neural network, learning data that have the same time step are required. Therefore, an RNN in which one variable becomes single-modal was designed for each learning run. In the RNN, a cell is a gated recurrent unit (GRU) cell that presents sufficient accuracy in the small resource environment of mobile devices. The RNNs that are learned according to the variables can each operate without additional learning, even if the situation of the user’s mobile device changes. In a heterogeneous environment, missing values are generated by various types of errors, including errors caused by battery charge and discharge, sensor failure, equipment exchange, and near-field communication errors. The higher the missing value ratio, the greater the number of errors that are likely to occur. For this reason, to achieve a more stable heterogeneous health platform, missing values must be considered. In this study, a missing value was estimated by means of multimodal deep learning; that is, a multimodal deep learning method was designed with one neural network that was connected with each learned single-modal RNN using a fully connected network (FCN). Each RNN input value delivers mutual influence through the weights of the FCN, and thereby, it is possible to estimate an output value even if any one of the input values is missing. According to the evaluation in terms of representative value selection, when a representative value was selected by using the mean or median, the most stable service was achieved. As a result of the evaluation according to the estimation method, the accuracy of the RNN-based multimodal deep learning method is 3.91%p higher than that of the SVD method.
- Research Article
5
- 10.1109/access.2025.3556700
- Jan 1, 2025
- IEEE Access
Chemotherapy-induced cardiotoxicity presents a major risk to cancer patients, often leading to severe cardiac complications such as heart failure, myocardial infarction, and arrhythmias. Early detection is crucial for preventing long-term damage and improving patient outcomes, yet existing diagnostic methods struggle to effectively capture the complexity of multimodal medical data and often lack interpretability. In this study, we propose an innovative approach that integrates multimodal deep learning with Explainable AI (XAI) techniques to enhance early cardiotoxicity detection. Our model combines clinical data (e.g., age and cardiovascular metrics) with Tissue Doppler Imaging (TDI), a functional imaging technique that captures myocardial velocity during the cardiac cycle. To overcome data limitations, we employed Conditional Generative Adversarial Networks (cGANs) and Conditional Tabular Generative Adversarial Networks (CTGANs) to augment the dataset, improving its diversity and balance for better model training. We developed three architectures that integrate Convolutional Neural Networks (CNNs) for feature extraction from TDI images with Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Transformer models to capture temporal dependencies and enhance prediction accuracy. Additionally, we incorporated SHapley Additive Explanations (SHAP) to interpret the contribution of input features, increasing model transparency and clinical applicability. Our Transformer-based model achieved the highest accuracy of 96%, outperforming the GRU (94%) and LSTM (89%) models, significantly surpassing traditional approaches. These findings highlight the potential of transformer-based architectures in multimodal deep learning for precise cardiotoxicity prediction, supporting early intervention and personalized treatment strategies while improving interpretability through XAI techniques such as SHAP.
- Dissertation
- 10.32657/10356/182346
- Jan 1, 2025
Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.
- Supplementary Content
28
- 10.1093/genetics/iyae161
- Nov 5, 2024
- Genetics
Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.
- Research Article
160
- 10.1145/3545572
- Feb 17, 2023
- ACM Transactions on Multimedia Computing, Communications, and Applications
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.
- Conference Article
- 10.1109/sm57895.2023.10112492
- Mar 19, 2023
The ability to predict traffic flow over time for crowded areas during rush hours is increasingly important as it can help authorities make informed decisions for congestion mitigation or scheduling of infrastructure development in an area. However, a crucial challenge in traffic flow forecasting is the slow shifting in temporal peaks between daily and weekly cycles, resulting in the nonstationarity of the traffic flow signal and leading to difficulty in accurate forecasting. To address this challenge, we propose a slow shifting concerned machine learning method for traffic flow forecasting, which includes two parts. First, we take advantage of Empirical Mode Decomposition as the feature engineering to alleviate the nonstationarity of traffic flow data, yielding a series of stationary components. Second, due to the superiority of Long-Short-Term-Memory networks in capturing temporal features, an advanced traffic flow forecasting model is developed by taking the stationary components as inputs. Finally, we apply this method on a benchmark of real-world data and provide a comparison with other existing methods. Our proposed method outperforms the state-of-art results by 14.55% and 62.56% using the metrics of root mean squared error and mean absolute percentage error, respectively.
- Conference Article
6
- 10.1109/cac51589.2020.9327749
- Nov 6, 2020
Short time traffic flow forecasting is the heart of matter in intelligent transportation system (ITS). Accurate traffic flow prediction can help people to choose trip mode and trip time. Although gated recurrent unit (GRU) has outstanding performance in traffic flow forecasting, but determines the hyperparameters of the GRU rely by experience reduces the predictive effect of the model. This study uses the adaptive learning strategy improved particle swarm optimization (IPSO) algorithm to optimize the hyperparameters of GRU model. The characteristics of traffic data with network topology are matched by this algorithm, so the accuracy of traffic flow prediction can be improved. To verify the reliability of this algorithm, this study construct IPSO-GRU model by the traffic flow data from California department of transportation and compare IPSO-GRU model with other traffic flow forecasting models. The experimental results shows that, the IPSO-GRU model achieves the lowest mean square error (MSE), Mean Absolute Percentage Error (MAPE) and Mean Absolute Error (MAE) compared to conventional GRU model.
- Research Article
4
- 10.1016/j.jdent.2023.104588
- Jun 21, 2023
- Journal of Dentistry
Multi-modal deep learning for automated assembly of periapical radiographs
- Research Article
400
- 10.1007/s00371-021-02166-7
- Jun 10, 2021
- The Visual Computer
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.
- Research Article
- 10.36962/pahtei53052025-102
- Apr 30, 2025
- PAHTEI-Procedings of Azerbaijan High Technical Educational Institutions
The increasing complexity of urban transportation systems and the growing volume of vehicles have made traffic congestion a persistent challenge in modern cities. Efficient traffic flow prediction is essential for mitigating congestion, improving road safety, optimizing traffic signal control, and enhancing overall transportation efficiency. In recent years, artificial intelligence (AI) has emerged as a transformative tool in the field of traffic management, offering sophisticated algorithms capable of modeling, analyzing, and predicting complex traffic patterns with high accuracy. The application of AI in traffic flow prediction leverages vast amounts of real-time and historical data to generate precise forecasts, supporting data-driven decision-making by urban planners and traffic control authorities. The prediction of traffic flow involves analyzing time-series data that exhibit nonlinear, dynamic, and often stochastic behavior. Traditional statistical models, such as autoregressive integrated moving average (ARIMA), have proven to be limited in handling the high dimensionality and variability inherent in traffic systems. In contrast, AI algorithms possess the capacity to learn and adapt from complex data inputs without the need for explicit programming, making them particularly suitable for traffic-related applications. AI algorithms used in traffic flow prediction can be broadly categorized into machine learning (ML) and deep learning (DL) approaches. Machine learning algorithms such as k-nearest neighbors (KNN), support vector machines (SVM), decision trees, and random forests have demonstrated effectiveness in short-term traffic prediction tasks. These algorithms are capable of identifying hidden patterns in traffic data and adjusting to changes in traffic behavior over time. Ensemble methods, which combine the strengths of multiple learning models, further enhance prediction accuracy and robustness. Deep learning algorithms, a subfield of AI inspired by the human brain’s neural architecture, have shown exceptional performance in capturing spatial-temporal dependencies in traffic data. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM) networks and gated recurrent units (GRUs), are widely used for their ability to process sequential data and retain information over extended time intervals. Convolutional neural networks (CNNs) are employed to extract spatial features from traffic sensor data or road network imagery. Hybrid models that integrate CNNs with RNNs have achieved high levels of predictive precision by simultaneously learning spatial and temporal correlations. In addition to supervised learning methods, unsupervised and reinforcement learning techniques are also applied in traffic flow prediction. Clustering algorithms, such as k-means and DBSCAN, assist in identifying traffic patterns, while reinforcement learning models optimize adaptive traffic signal control systems by learning optimal actions through environmental interaction. This study explores the different types of AI algorithms used in traffic flow prediction, examining their theoretical foundations, structural differences, and practical applications. It aims to evaluate the comparative advantages of various algorithms in addressing the challenges of real-time traffic prediction in increasingly complex transportation networks. Keywords: Machine Learning, Deep Learning, Neural Networks, Regression Models, Reinforcement Learning
- Research Article
798
- 10.1109/tits.2006.869623
- Mar 1, 2006
- IEEE Transactions on Intelligent Transportation Systems
A new approach based on Bayesian networks for traffic flow forecasting is proposed. In this paper, traffic flows among adjacent road links in a transportation network are modeled as a Bayesian network. The joint probability distribution between the cause nodes (data utilized for forecasting) and the effect node (data to be forecasted) in a constructed Bayesian network is described as a Gaussian mixture model (GMM) whose parameters are estimated via the competitive expectation maximization (CEM) algorithm. Finally, traffic flow forecasting is performed under the criterion of minimum mean square error (mmse). The approach departs from many existing traffic flow forecasting models in that it explicitly includes information from adjacent road links to analyze the trends of the current link statistically. Furthermore, it also encompasses the issue of traffic flow forecasting when incomplete data exist. Comprehensive experiments on urban vehicular traffic flow data of Beijing and comparisons with several other methods show that the Bayesian network is a very promising and effective approach for traffic flow modeling and forecasting, both for complete data and incomplete data
- Research Article
40
- 10.1109/access.2022.3202976
- Jan 1, 2022
- IEEE Access
During the past decade, social media platforms have been extensively used during a disaster for information dissemination by the affected community and humanitarian agencies. Although many studies have been done recently to classify the informative and non-informative messages from social media posts, most are unimodal, i.e., have independently used textual or visual data to build the deep learning models. In the present study, we integrate the complementary information provided by the text and image messages about the same event posted by the affected community on the social media platform Twitter and build a multimodal deep learning model based on the concept of attention mechanism. The attention mechanism is a recent breakthrough that has revolutionized the field of deep learning. Just as humans pay more attention to a specific part of the text or image, ignoring the rest, neural networks can also be trained to concentrate on more relevant features through the attention mechanism. We propose a novel Cross-Attention Multi-Modal (CAMM) deep neural network for classifying multimodal disaster data, which uses the attention mask of the textual modality to highlight the features of the visual modality. We compare CAMM with unimodal models and the most popular bilinear multimodal models, MUTAN and BLOCK, generally used for visual question answering. CAMM achieves an average F1-score of 84.08%, better than the MUTAN and BLOCK methods by 6.31% and 5.91%, respectively. The proposed cross-attention-based multimodal deep learning method outperforms the current state-of-the-art fusion methods on the benchmark multimodal disaster dataset by highlighting the more relevant cross-domain features of text and image tweets. This study affirms that social media platforms become a rich source of multimodal data during a disaster. This data can be utilized to build automated tools for quick filtration of informative messages to assess the post-disaster needs of the affected community and provide timely help.
- Research Article
58
- 10.1016/j.imavis.2025.105509
- May 1, 2025
- Image and Vision Computing
Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.
- Research Article
17
- 10.3389/frai.2023.1247195
- Oct 27, 2023
- Frontiers in artificial intelligence
Hepatocellular carcinoma is a malignant neoplasm of the liver and a leading cause of cancer-related deaths worldwide. The multimodal data combines several modalities, such as medical images, clinical parameters, and electronic health record (EHR) reports, from diverse sources to accomplish the diagnosis of liver cancer. The introduction of deep learning models with multimodal data can enhance the diagnosis and improve physicians' decision-making for cancer patients. This scoping review explores the use of multimodal deep learning techniques (i.e., combining medical images and EHR data) in diagnosing and prognosis of hepatocellular carcinoma (HCC) and cholangiocarcinoma (CCA). A comprehensive literature search was conducted in six databases along with forward and backward references list checking of the included studies. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) extension for scoping review guidelines were followed for the study selection process. The data was extracted and synthesized from the included studies through thematic analysis. Ten studies were included in this review. These studies utilized multimodal deep learning to predict and diagnose hepatocellular carcinoma (HCC), but no studies examined cholangiocarcinoma (CCA). Four imaging modalities (CT, MRI, WSI, and DSA) and 51 unique EHR records (clinical parameters and biomarkers) were used in these studies. The most frequently used medical imaging modalities were CT scans followed by MRI, whereas the most common EHR parameters used were age, gender, alpha-fetoprotein AFP, albumin, coagulation factors, and bilirubin. Ten unique deep-learning techniques were applied to both EHR modalities and imaging modalities for two main purposes, prediction and diagnosis. The use of multimodal data and deep learning techniques can help in the diagnosis and prediction of HCC. However, there is a limited number of works and available datasets for liver cancer, thus limiting the overall advancements of AI for liver cancer applications. Hence, more research should be undertaken to explore further the potential of multimodal deep learning in liver cancer applications.