Multi-modal Transformer for Indoor Human Action Recognition

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Indoor human action recognition is used in various fields. For example, we can use it to recognize exercise movements in the fitness industry, which can significantly help improve the health of modern people. With the development of sensors, it has become possible to easily acquire multiple data modalities of RGB, IR, depth, and skeleton in the same scene. Since each data modality is complementary, proper fusion is beneficial in recognizing human action. However, existing studies have limitations in utilizing the advantages of each modality. Therefore, we propose a Multi-Modal Transformer (MMT) to use RGB and skeleton data simultaneously in this work. Using the transformer-based structure, MMT can capture the correlation between non-local joints in skeleton data modality. In addition, MMT does not require additional training phases or multiple trained networks as the number of people on the scene changes. In experiments on public benchmark datasets, MMT shows comparable results using only eight input frames.

Similar Papers
  • Book Chapter
  • Cite Count Icon 26
  • 10.1007/978-3-031-25072-9_41
Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection
  • Jan 1, 2023
  • Wei-Yu Lee + 2 more

Pedestrian detection is an important challenge in computer vision due to its various applications. To achieve more accurate results, thermal images have been widely exploited as complementary information to assist conventional RGB-based detection. Although existing methods have developed numerous fusion strategies to utilize the complementary features, research that focuses on exploring features exclusive to each modality is limited. On this account, the features specific to one modality cannot be fully utilized and the fusion results could be easily dominated by the other modality, which limits the upper bound of discrimination ability. Hence, we propose the Cross-modality Attention Transformer (CAT) to explore the potential of modality-specific features. Further, we introduce the Multimodal Fusion Transformer (MFT) to identify the correlations between the modality data and perform feature fusion. In addition, a content-aware objective function is proposed to learn better feature representations. The experiments show that our method can achieve state-of-the-art detection performance on public datasets. The ablation studies also show the effectiveness of the proposed components.KeywordsCross-modality fusionMultimodal pedestrian detectionTransformer

  • Research Article
  • Cite Count Icon 564
  • 10.1109/tpami.2022.3183112
Human Action Recognition From Various Data Modalities: A Review.
  • Jan 1, 2022
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Zehua Sun + 5 more

Human Action Recognition (HAR) aims to understand human behavior and assign a label to each action. It has a wide range of applications, and therefore has been attracting increasing attention in the field of computer vision. Human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, radar, and WiFi signal, which encode different sources of useful yet distinct information and have various advantages depending on the application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this article, we present a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality. Specifically, we review the current mainstream deep learning methods for single data modalities and multiple data modalities, including the fusion-based and the co-learning-based frameworks. We also present comparative results on several benchmark datasets for HAR, together with insightful observations and inspiring future research directions.

  • Research Article
  • 10.59782/sidr.v2i1.122
Social Event Classification Based on Multimodal Masked Transformer Network
  • Oct 7, 2024
  • Scientific Insights and Discoveries Review
  • Chen Hong + 4 more

The key to multimodal social event classification is to fully and accurately utilize the features of both image and text modalities. However, most existing methods have the following limitations: (1) they simply concatenate the image features and text features of the event, and (2) there is irrelevant contextual information between different modalities, which leads to mutual interference. Therefore, it is not enough to only consider the relationship between the modalities of multimodal data, but also the irrelevant contextual information (i.e., regions or words) between the modalities. To overcome these limitations, a novel social event classification method based on multimodal masked transformer network (MMTN) is proposed. A better representation of text and image is learned through an image-text encoding network. Then, the obtained image and text representations are input into the multimodal masked transformer network to fuse the multimodal information, and the relationship between the modalities of multimodal information is modeled by calculating the similarity between the multimodal information, masking the irrelevant context between the modalities. Extensive experiments on two benchmark datasets show that the proposed multimodal masked transformer network model achieves state-of-the-art performance.

  • Research Article
  • Cite Count Icon 3
  • 10.18178/joig.11.4.343-352
Residual Neural Networks for Human Action Recognition from RGB-D Videos
  • Dec 1, 2023
  • Journal of Image and Graphics
  • K Venkata Subbareddy + 3 more

Recently, the RGB-D based Human Action Recognition (HAR) has gained significant research attention due to the provision of complimentary information by different data modalities. However, the current models have experienced still unsatisfactory results due to several problems including noises and view point variations between different actions. To sort out these problems, this paper proposes two new action descriptors namely Modified Depth Motion Map (MDMM) and Spherical Redundant Joint Descriptor (SRJD). MDMM eliminates the noises from depth maps and preserves only the action related information. Further SRJD ensures resilience against view point variations and reduces the misclassifications between different actions with similar view properties. Further, to maximize the recognition accuracy, standard deep learning algorithm called as Residual Neural Network (ResNet) is used to train the system through the features extracted from MDMM and SRJD. Simulation experiments prove that the multiple data modalities are better than single data modality. The proposed approach was tested on two public datasets namely NTURGB+D dataset and UTD-MHAD dataset. The testing results declare that the proposed approach is superior to the earlier HAR methods. On an average, the proposed system gained an accuracy of 90.0442% and 92.3850% at Cross-subject and Cross-view validations respectively.

  • Research Article
  • 10.1088/1361-6560/ae0be6
Foundation model based multimodal transformer framework for survival analysis in HER2 stratified breast cancer
  • Oct 8, 2025
  • Physics in Medicine & Biology
  • Qiang Li + 9 more

Objective. To improve survival prediction for HER2-positive breast cancer by integrating histopathological, molecular, and clinical data using a multimodal transformer framework.Approach. We propose a multimodal transformer framework for breast cancer survival prediction using HER2 stratified (SurvMBC), a foundation model-enhanced architecture that fuses three data modalities: whole-slide images, clinical narratives, and molecular features. Tumor microenvironment features are extracted using a pathology language and image pre-training (PLIP), clinical narratives are processed with BioBERT, and miRNA expression plus DNA methylation data are embedded using Gen2Vec. These representations are integrated through a cross-modal transformer with attention mechanisms for survival prediction.Main results. The model was evaluated on 1,095 HER2-positive breast cancer patients from The Cancer Genome Atlas. SurvMBC achieved a concordance index (C-index) of 0.857 (95% CI: 0.834, 0.880), a low integrated Brier score, and a strong inverse negative binomial log-likelihood. Risk stratification based on model outputs significantly separated high- and low-risk groups (log-rankp< 0.01) and showed strong associations with tumor stage, grade, and hormone receptor status (allp< 0.05).Significance. SurvMBC demonstrates the effectiveness of multimodal fusion in addressing tumor heterogeneity and improving prognostic accuracy. The attention-based integration enables context-aware learning of survival-relevant features across modalities, supporting individualized risk stratification and risk-adaptive treatment planning for HER2 stratified breast cancer patients.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.dib.2022.108564
AI based monitoring violent action detection data for in-vehicle scenarios
  • Aug 31, 2022
  • Data in Brief
  • Nelson R.P Rodrigues + 5 more

With the evolution of technology associated with mobility and autonomy, Shared Autonomous Vehicles will be a reality. To ensure passenger safety, there is a need to create a monitoring system inside the vehicle capable of recognizing human actions. We introduce two datasets to train human action recognition inside the vehicle, focusing on violence detection. The InCar dataset tackles violent actions for in-car background which give us more realistic data. The InVicon dataset although doesn't have the realistic background as the InCar dataset can provide skeleton (3D body joints) data. This datasets were recorded with RGB, Depth, Thermal, Event-based, and Skeleton data. The resulting dataset contains 6 400 video samples and more than 3 million frames, collected from sixteen distinct subjects. The dataset contains 58 action classes, including violent and neutral (i.e., non-violent) activities.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-031-20233-9_42
Gait Recognition with Various Data Modalities: A Review
  • Jan 1, 2022
  • Wei Li + 5 more

Gait recognition aims to recognize one subject by the way she/he walks without alerting the subject, which has drawn increasing attention. Recently, gait recognition can be represented using various data modalities, such as RGB, skeleton, depth, infrared data, acceleration, gyroscope, .etc., which have various advantages depending on the application scenarios. In this paper, we present a comprehensive survey of recent progress in gait recognition methods based on the type of input data modality. Specifically, we review commonly-used gait datasets with different gait data modalities, following with effective gait recognition methods both for single data modality and multiple data modalities. We also present comparative results of effective gait recognition approaches, together with insightful observations and discussions.KeywordsGait recognitionSensorDeep learningData modality

  • Research Article
  • 10.31449/inf.v49i20.10585
Two-Way Classroom Interaction Analysis via a Coupled ConvNeXt–Multimodal Transformer for Fine-grained Behavior Recognition
  • Dec 15, 2025
  • Informatica
  • Yuyan Huang + 1 more

With the deepening of the digital transformation of education, intelligent analysis of classroom teaching behavior has become the key to improving teaching quality. Traditional methods are difficult to effectively integrate multi-source heterogeneous data in the classroom, and there are limitations in the joint modeling of spatiotemporal features. To this end, a bidirectional analysis framework coupling multimodal transformer and convolutional neural network (CNN) is proposed: ConvNeXt-T is used as the CNN backbone to extract the spatial features of teachers' body movements, students' postures and scene layouts, and the time dependence and cross-modal global correlation of teacher-student language interaction are captured with the help of multimodal transformers. The study uses 500 minutes of multimodal data from 10 real classrooms (4K camera 30 frames per second, total frames 900,000 frames) as the core dataset, annotates 7 types of behaviors such as teacher teaching, questioning, and student answering, and uses the PyTorch framework to train on NVIDIA GTX 4090 GPU, using AdamW as the optimizer, mixed loss function to process 8 batches of data, and the loss stabilizes at about 0.17 after 80 rounds of training. The results show that the accuracy of the multimodal fusion model is 90.2% in the behavior recognition task, which is significantly higher than that of the single-modal model. The spatio-temporal feature interaction module increases the detection rate of cross-modal correlation by 6.0%, and effectively identifies the linkage relationship between teachers' gesture pointing and students' responses. In the classification of teacher-student interaction, the F1 value of the model reached 88.4%, which was significantly higher than that of the benchmark model. In addition, the model has excellent generalization on public datasets, with an accuracy of 96.54% for NTU60-CV (cross-viewing angle), 98.30% for behavior recognition of UTD-MHAD, and an AUC value of 0.7478. This framework provides new ideas for solving fine-grained behavior analysis in educational scenarios and provides technical support for intelligent teaching evaluation.

  • Research Article
  • 10.1158/1557-3265.sabcs24-ps11-08
Abstract PS11-08: MRI improves multi-modal AI system for breast cancer diagnosis and prognosis
  • Jun 13, 2025
  • Clinical Cancer Research
  • Yanqi Xu + 8 more

Background: MRI is the most sensitive imaging modality for breast cancer detection and is not affected by breast density. Screening MRI has higher specificity than mammography in high-risk populations, including women with a family history of breast cancer, BRCA1/2 mutations, and a personal history of breast cancer. The ACS screening guidelines recommend MRI supplemented with mammography for women at high risk (≥ 20%-25% lifetime risk). MRI is also used for diagnosing breast cancer when mammography and ultrasound are inconclusive. We investigate how MRI can improve cancer detection and risk prediction with a multi-modal AI system. Current standard-of-care risk models, such as the TC model, rely solely on clinical variables and do not account for the rich information in imaging data. Other existing AI systems typically analyze single imaging modality, usually mammography. Our multi-modal transformer (MMT) learns from longitudinal imaging data of multiple modalities, FFDM, DBT, US and MRI. Methods: We utilized the NYU Multimodal Breast Cancer Dataset, comprising 1,372,455 exams from 298,670 patients (age 30-108, mean 56.55 years, SD 12.00 years) between 2010 and 2022, for MMT training and evaluation. Our objective is to predict whether a patient currently has cancer and, if not, assess the risk of developing cancer in the future, incorporating data from all available, present and prior, breast imaging. Our method involves three steps: (1) training modality-specific feature extractors separately to generate image-level and patch-level feature embeddings; (2) combining image embeddings with additional variables including age, modality, study date and view; (3) feeding the combined embeddings into a transformer for cancer prediction. The model outputs two predictions, the patient's probability of having cancer and the patient’s risk of getting cancer within 5 years. Results We evaluated our model on a subgroup of patients who had at least one MRI in their records. The MMT model achieved an AUROC of 0.943 (95% CI: 0.935, 0.950) for cancer detection and 0.796 (95% CI: 0.765, 0.826) for 5-year risk prediction across all modalities. We separately compared our model’s AUROC on non-MRI exams and MRI exams with the corresponding baselines. For non-MRI exams, the MMT model with MRI data, achieved an AUROC of 0.939 (95% CI: 0.929, 0.948) for cancer detection and 0.778 (95% CI: 0.742, 0.810) for 5-year risk prediction, which improved the baseline MMT model without MRI by 0.024 and 0.044 (two-sided DeLong’s test, P &amp;lt; 0.01 for both) respectively. These results demonstrate that incorporating MRI improves both cancer detection and risk prediction for non-MRI exams. For MRI exams, the MMT model achieved an AUROC of 0.947 (95% CI: 0.934, 0.958) for cancer detection, improving by 0.029 (two-sided Delong’s test, P &amp;lt; 0.01) compared to an MRI-only baseline. This indicates that including prior imaging enhances the effectiveness of MRI in detecting cancer. However, for risk prediction on MRI exams, there was no significant improvement (ΔAUROC 0.004: two-sided DeLong’s test, P = 0.94). Additionally, MMT’s risk prediction AUROC on MRI exams was lower than other modalities (0.719, 95% CI: 0.615, 0.813), suggesting that MRI alone has less predictive power for future risk. Citation Format: Yanqi Xu, Jungkyu Park, Yiqiu Shen, Frank Yeung, Joe Cappadona, Jan Witowski, Linda Pak, Freya Schnabel, Krzysztof J. Geras. MRI improves multi-modal AI system for breast cancer diagnosis and prognosis [abstract]. In: Proceedings of the San Antonio Breast Cancer Symposium 2024; 2024 Dec 10-13; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(12 Suppl):Abstract nr PS11-08.

  • Research Article
  • Cite Count Icon 22
  • 10.1109/bhi.2016.7455963
Integration of Multi-Modal Biomedical Data to Predict Cancer Grade and Patient Survival.
  • Feb 1, 2016
  • ... IEEE-EMBS International Conference on Biomedical and Health Informatics. IEEE-EMBS International Conference on Biomedical and Health Informatics
  • John H Phan + 4 more

The Big Data era in Biomedical research has resulted in large-cohort data repositories such as The Cancer Genome Atlas (TCGA). These repositories routinely contain hundreds of matched patient samples for genomic, proteomic, imaging, and clinical data modalities, enabling holistic and multi-modal integrative analysis of human disease. Using TCGA renal and ovarian cancer data, we conducted a novel investigation of multi-modal data integration by combining histopathological image and RNA-seq data. We compared the performances of two integrative prediction methods: majority vote and stacked generalization. Results indicate that integration of multiple data modalities improves prediction of cancer grade and outcome. Specifically, stacked generalization, a method that integrates multiple data modalities to produce a single prediction result, outperforms both single-data-modality prediction and majority vote. Moreover, stacked generalization reveals the contribution of each data modality (and specific features within each data modality) to the final prediction result and may provide biological insights to explain prediction performance.

  • Research Article
  • Cite Count Icon 269
  • 10.1109/tip.2018.2818328
Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection.
  • May 20, 2018
  • IEEE Transactions on Image Processing
  • Sijie Song + 4 more

Human action analytics has attracted a lot of attention for decades in computer vision. It is important to extract discriminative spatio-temporal features to model the spatial and temporal evolutions of different actions. In this paper, we propose a spatial and temporal attention model to explore the spatial and temporal discriminative features for human action recognition and detection from skeleton data. We build our networks based on the recurrent neural networks with long short-term memory units. The learned model is capable of selectively focusing on discriminative joints of skeletons within each input frame and paying different levels of attention to the outputs of different frames. To ensure effective training of the network for action recognition, we propose a regularized cross-entropy loss to drive the learning process and develop a joint training strategy accordingly. Moreover, based on temporal attention, we develop a method to generate the action temporal proposals for action detection. We evaluate the proposed method on the SBU Kinect Interaction data set, the NTU RGB + D data set, and the PKU-MMD data set, respectively. Experiment results demonstrate the effectiveness of our proposed model on both action recognition and action detection.

  • Research Article
  • 10.1200/jco.2025.43.16_suppl.4181
Multimodal machine learning predictions of treatment response and survival in advanced pancreatic cancer from the COMPASS trial.
  • Jun 1, 2025
  • Journal of Clinical Oncology
  • Wei Quan + 14 more

4181 Background: Pancreatic cancer is an aggressive malignancy with limited therapeutic options and a poor prognosis. Current approaches to prognostication are limited, especially in advanced disease. We explored whether machine learning integrating multi-modal data could predict outcomes in advanced pancreatic cancer. Methods: We developed and evaluated machine learning models predicting disease control rate and one-year survival from the COMPASS trial (NCT02750657). Data modalities included clinical features, histopathology, radiology, RNAseq, and whole-genome sequencing (WGS). After pre-processing, we applied LASSO and XGBoost to each modality and early and late fusion techniques. Hyperparameter tuning and performance assessment were performed using repeated nested cross-validation. The PurIST RNAseq classifier served as a baseline. Area under the curve (AUC) was the primary metric. Results: The cohort included 260 patients (105 female; median age 64 [IQR 58–70]; 141 treated with FOLFIRINOX, 97 with gemcitabine and nab-paclitaxel). 170 (65%) achieved disease control and 168 (65%) survived at least one year. The performance of the machine learning models is shown in the Table. Predictions from the unimodal models had limited correlation with each other (the maximum pairwise correlation averaged across folds was between clinical and histopathology models, 0.21). The late fusion models up-weighted data modalities with stronger unimodal performance. Conclusions: Multiple individual data modalities can predict outcomes in advanced pancreatic cancer, with PurIST serving as a strong baseline. Despite differing predictions across data modalities, multimodal integration did not improve prognostic performance in this cohort. AUC for the PurIST baseline, the top 2 unimodal models, and the best fusion model for each outcome. Outcome Data Modality AUC (95% confidence interval) Disease control PurIST 0.69 (0.69, 0.70) Radiomics 0.75 (0.72, 0.79) RNAseq 0.71 (0.70, 0.72) Fusion (late) 0.71 (0.69, 0.73) One-year survival PurIST 0.63 (0.62, 0.63) DNA mutations 0.64 (0.61, 0.66) RNAseq 0.57 (0.55, 0.60) Fusion (early) 0.61 (0.56, 0.66)

  • Dissertation
  • 10.12794/metadc2137646
Multiomics Data Integration and Multiplex Graph Neural Network Approaches
  • May 1, 2023
  • Ziynet Nesibe Kesimoglu

With increasing data and technology, multiple types of data from the same set of nodes have been generated. Since each data modality contains a unique aspect of the underlying mechanisms, multiple datatypes are integrated. In addition to multiple datatypes, networks are important to store information representing associations between entities such as genes of a protein-protein interaction network and authors of a citation network. Recently, some advanced approaches to graph-structured data leverage node associations and features simultaneously, called Graph Neural Network (GNN), but they have limitations for integrative approaches. The overall aim of this dissertation is to integrate multiple data modalities on graph-structured data to infer some context-specific gene regulation and predict outcomes of interest. To this end, first, we introduce a computational tool named CRINET to infer genome-wide competing endogenous RNA (ceRNA) networks. By integrating multiple data properly, we had a better understanding of gene regulatory circuitry addressing important drawbacks pertaining to ceRNA regulation. We tested CRINET on breast cancer data and found that ceRNA interactions and groups were significantly enriched in the cancer-related genes and processes. CRINET-inferred ceRNA groups supported the studies claiming the relation between immunotherapy and cancer. Second, we present SUPREME, a node classification framework, by comprehensively analyzing multiple data and associations between nodes with graph convolutions on multiple networks. Our results on survival analysis suggested that SUPREME could demystify the characteristics of classes with proper utilization of multiple data and networks. Finally, we introduce an attention-aware fusion approach, called GRAF, which fuses multiple networks and utilizes attention mechanisms on graph-structured data. Utilization of learned node- and association-level attention with network fusion allowed us to prioritize the edges properly, leading to improvement in the prediction results. Given the findings of all three tools and their outperformance over state-of-the-art methods, the proposed dissertation shows the importance of integrating multiple types of data and the exploitation of multiple graph structured data.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.eswa.2024.125642
Vision-based human action quality assessment: A systematic review
  • Nov 14, 2024
  • Expert Systems With Applications
  • Jiang Liu + 5 more

Human Action Quality Assessment (AQA), which aims to automatically evaluate the performance of actions executed by humans, is an emerging field of human action analysis. Although many review articles have been conducted for human action analysis fields such as action recognition and action prediction, there is a lack of up-to-date and systematic reviews related to AQA. This paper aims to provide a systematic literature review of existing papers on vision-based human AQA. This systematic review was rigorously conducted following the PRISMA guideline through the databases of Scopus, IEEE Xplore, and Web of Science in July 2024. Ninety-six research articles were selected for the final analysis after applying inclusion and exclusion criteria. This review presents an overview of various aspects of AQA, including existing applications, data acquisition methods, public datasets, state-of-the-art methods and evaluation metrics. We observe an increase in the number of studies in AQA since 2019, primarily due to the advent of deep learning methods and motion capture devices. We categorize these AQA methods into skeleton-based and video-based methods based on the data modality used. There are different evaluation metrics for various AQA tasks. SRC is the most commonly used evaluation metric, with fifty-six out of ninety-six selected papers using it to evaluate their models. Sports event scoring, surgical skill evaluation and rehabilitation assessment are the most popular three scenarios in this direction based on existing papers and there are more new scenarios being explored such as piano skill assessment. Furthermore, the existing challenges and future research directions are provided, which can be a helpful guide for researchers to explore AQA.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.neuroimage.2021.118854
A hierarchical Bayesian model to find brain-behaviour associations in incomplete data sets
  • Dec 29, 2021
  • Neuroimage
  • Fabio S Ferreira + 4 more

Canonical Correlation Analysis (CCA) and its regularised versions have been widely used in the neuroimaging community to uncover multivariate associations between two data modalities (e.g., brain imaging and behaviour). However, these methods have inherent limitations: (1) statistical inferences about the associations are often not robust; (2) the associations within each data modality are not modelled; (3) missing values need to be imputed or removed. Group Factor Analysis (GFA) is a hierarchical model that addresses the first two limitations by providing Bayesian inference and modelling modality-specific associations. Here, we propose an extension of GFA that handles missing data, and highlight that GFA can be used as a predictive model. We applied GFA to synthetic and real data consisting of brain connectivity and non-imaging measures from the Human Connectome Project (HCP). In synthetic data, GFA uncovered the underlying shared and specific factors and predicted correctly the non-observed data modalities in complete and incomplete data sets. In the HCP data, we identified four relevant shared factors, capturing associations between mood, alcohol and drug use, cognition, demographics and psychopathological measures and the default mode, frontoparietal control, dorsal and ventral networks and insula, as well as two factors describing associations within brain connectivity. In addition, GFA predicted a set of non-imaging measures from brain connectivity. These findings were consistent in complete and incomplete data sets, and replicated previous findings in the literature. GFA is a promising tool that can be used to uncover associations between and within multiple data modalities in benchmark datasets (such as, HCP), and easily extended to more complex models to solve more challenging tasks.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant