HMMED: A Multimodal Model with Separate Head and Payload Processing for Malicious Encrypted Traffic Detection
Malicious encrypted traffic detection is a critical component of network security management. Previous detection methods can be categorized into two classes as follows: one is to use the feature engineering method to construct traffic features for classification and the other is to use the end-to-end method that directly inputs the original traffic to obtain traffic features for classification. Both of the abovementioned two methods have the problem that the obtained features cannot fully characterize the traffic. To this end, this paper proposes a hierarchical multimodal deep learning model (HMMED) for malicious encrypted traffic detection. This model adopts the abovementioned two feature generation methods to learn the features of payload and header, respectively, then fuses the features to get the final traffic features, and finally inputs the final traffic features into the softmax classifier for classification. In addition, since traditional deep learning is highly dependent on the training set size and data distribution, resulting in a model that is not very generalizable and difficult to adapt to unseen encrypted traffic, the model proposed in this paper uses a large amount of unlabeled encrypted traffic in the pretraining layer to pretrain a submodel used to obtain a generic packet payload representation. The test results on the USTC-TFC2016 dataset show that the proposed model can effectively solve the problem of insufficient feature extraction of traditional detection methods and improve the ACC of malicious encrypted traffic detection.
- Research Article
7
- 10.1007/s00259-024-07065-2
- Jan 27, 2025
- European journal of nuclear medicine and molecular imaging
To develop and validate a prostate-specific membrane antigen (PSMA) PET/CT based multimodal deep learning model for predicting pathological lymph node invasion (LNI) in prostate cancer (PCa) patients identified as candidates for extended pelvic lymph node dissection (ePLND) by preoperative nomograms. [68Ga]Ga-PSMA-617 PET/CT scan of 116 eligible PCa patients (82 in the training cohort and 34 in the test cohort) who underwent radical prostatectomy with ePLND were analyzed in our study. The Med3D deep learning network was utilized to extract discriminative features from the entire prostate volume of interest on the PET/CT images. Subsequently, a multimodal model i.e., Multi kernel Support Vector Machine was constructed to combine the PET/CT deep learning features, quantitative PET and clinical parameters. The performance of the multimodal models was assessed using final histopathology as the reference standard, with evaluation metrics including area under the receiver operating characteristic curve (AUC), calibration curve, decision curve analysis, and compared with available nomograms and PET/CT visual evaluation result. Our multimodal model incorporated clinical information, maximum standardized uptake value (SUVmax), and PET/CT deep learning features. The AUC for predicting LNI was 0.89 (95% confidence interval [CI] 0.81-0.97) for the final model. The proposed model demonstrated superior predictive accuracy in the test cohort compared to PET/CT visual evaluation result, the Memorial Sloan Kettering Cancer Center (MSKCC) and the Briganti-2017 nomograms (AUC 0.85 [95% CI 0.69-1.00] vs. 0.80 [95% CI 0.64-0.95] vs. 0.79 [95% CI 0.61-0.97] and 0.69 [95% CI 0.50-0.88], respectively). The proposed model showed similar calibration and higher net benefit as compared to the traditional nomograms. Our multimodal deep learning model, which incorporates preoperative PSMA PET/CT imaging, shows enhanced predictive capabilities for LNI in clinically localized PCa compared to PSMA PET/CT visual evaluation result and existing nomograms like the MSKCC and Briganti-2017 nomograms. This model has the potential to reduce unnecessary ePLND procedures while minimizing the risk of missing cases of LNI.
- Preprint Article
- 10.5194/egusphere-egu23-5818
- May 15, 2023
In general, water level prediction models using deep learning techniques have been developed using time-series water level observation data from upstream water level stations and target water level stations even though many of physical data are necessary to predict water level. The changes of the water level are greatly affected by rainfall in the basin, therefore rainfall information is needed to more accurately predict the water level. In particular, radar data has the advantage of being able to directly acquire the amount of rainfall occurring within a watershed. This study aims to develop the multimodal deep learning model to predict the water level using 2D grid radar rainfall data and 1D time-series water level observation data. This study proposed two multimodal deep learning models which have different structures. Both multimodal deep learning models predict the water level by simultaneously using the observed water level data up to the present time and the radar rainfall data that affects the water level in the future. The first proposed model consists of a deep learning network that links 2D Average Pooling (AvgPool2D), which compresses 2D radar data to 1D data, and Long Short-Term Memory (LSTM), which predicts 1D time series water level data. The second proposed model consists of a deep learning network that predicts water levels by linking Conv2DLSTM and LSTM, which can reflect the characteristics of 2D radar data without deformation.  The two proposed multimodal deep learning models were learned and evaluated in the upper basin of Hantan River. In addition, it was compared with the results of single-modal LSTM using only water level data. There are three water level stations in the study area, and the objective was to predict the water level of the downstream station up to 180 minutes in advance. For learning and verification of the deep learning model, 10-minute water level and radar rainfall data were collected from May 2019 to October 2021. For the radar data used as input, the grid data included in the target watershed were extracted and used among composite radar data with a resolution of 1 km operating by Ministry of Environment. As a result of evaluating each learned deep learning model, two multimodal models had higher prediction accuracy than the single-modal using only water level data. In particular, second proposed model (Conv2dLSTM+LSTM) had better predictive performance than first proposed model (AvgPool2D+LSTM) at the time of the sudden rise in water level due to rainfall.AcknowledgmentsResearch for this paper was carried out under the KICT Research Program (project no. 202200175-001, Development of future-leading technologies solving water crisis against to water disasters affected by climate change) funded by the Ministry of Science and ICT.
- Research Article
1
- 10.2174/0115734056301741240903072017
- Dec 17, 2024
- Current medical imaging
This study aimed to establish a multimodal deep-learning network model to enhance the diagnosis of benign and malignant pulmonary ground glass nodules (GGNs). Retrospective data on pulmonary GGNs were collected from multiple centers across China, including North, Northeast, Northwest, South, and Southwest China. The data were divided into a training set and a validation set in an 8:2 ratio. In addition, a GGN dataset was also obtained from our hospital database and used as the test set. All patients underwent chest computed tomography (CT), and the final diagnosis of the nodules was based on postoperative pathological reports. The Residual Network (ResNet) was used to extract imaging data, the Word2Vec method for semantic information extraction, and the Self Attention method for combining imaging features and patient data to construct a multimodal classification model. Then, the diagnostic efficiency of the proposed multimodal model was compared with that of existing ResNet and VGG models and radiologists The multicenter dataset comprised 1020 GGNs, including 265 benign and 755 malignant nodules, and the test dataset comprised 204 GGNs, with 67 benign and 137 malignant nodules. In the validation set, the proposed multimodal model achieved an accuracy of 90.2%, a sensitivity of 96.6%, and a specificity of 75.0%, which surpassed that of the VGG (73.1%, 76.7%, and 66.5%) and ResNet (78.0%, 83.3%, and 65.8%) models in diagnosing benign and malignant nodules. In the test set, the multimodal model accurately diagnosed 125 (91.18%) malignant nodules, outperforming radiologists (80.37% accuracy). Moreover, the multimodal model correctly identified 54 (accuracy, 80.70%) benign nodules, compared to radiologists' accuracy of 85.47%. The consistency test comparing radiologists' diagnostic results with the multimodal model's results in relation to postoperative pathology showed strong agreement, with the multimodal model demonstrating closer alignment with gold standard pathological findings (Kappa=0.720, P<0.01). The multimodal deep learning network model exhibited promising diagnostic effectiveness in distinguishing benign and malignant GGNs and, therefore, holds potential as a reference tool to assist radiologists in improving the diagnostic accuracy of GGNs, potentially enhancing their work efficiency in clinical settings.
- Research Article
5
- 10.7717/peerj-cs.1460
- Jul 17, 2023
- PeerJ Computer Science
To compare the diagnostic efficiencies of deep learning single-modal and multi-modal for the classification of benign and malignant breast mass lesions. We retrospectively collected data from 203 patients (207 lesions, 101 benign and 106 malignant) with breast tumors who underwent breast magnetic resonance imaging (MRI) before surgery or biopsy between January 2014 and October 2020. Mass segmentation was performed based on the three dimensions-region of interest (3D-ROI) minimum bounding cube at the edge of the lesion. We established single-modal models based on a convolutional neural network (CNN) including T2WI and non-fs T1WI, the dynamic contrast-enhanced (DCE-MRI) first phase was pre-contrast T1WI (d1), and Phases 2, 4, and 6 were post-contrast T1WI (d2, d4, d6); and Multi-modal fusion models with a Sobel operator (four_mods:T2WI, non-fs-T1WI, d1, d2). Training set (n=145), validation set (n=22), and test set (n=40). Five-fold cross validation was performed. Accuracy, sensitivity, specificity, negative predictive value, positive predictive value, and area under the ROC curve (AUC) were used as evaluation indicators. Delong's test compared the diagnostic performance of the multi-modal and single-modal models. All models showed good performance, and the AUC values were all greater than 0.750. Among the single-modal models, T2WI, non-fs-T1WI, d1, and d2 had specificities of 77.1%, 77.2%, 80.2%, and 78.2%, respectively. d2 had the highest accuracy of 78.5% and showed the best diagnostic performance with an AUC of 0.827. The multi-modal model with the Sobel operator performed better than single-modal models, with an AUC of 0.887, sensitivity of 79.8%, specificity of 86.1%, and positive prediction value of 85.6%. Delong's test showed that the diagnostic performance of the multi-modal fusion models was higher than that of the six single-modal models (T2WI, non-fs-T1WI, d1, d2, d4, d6); the difference was statistically significant (p = 0.043, 0.017, 0.006, 0.017, 0.020, 0.004, all were greater than 0.05). Multi-modal fusion deep learning models with a Sobel operator had excellent diagnostic value in the classification of breast masses, and further increase the efficiency of diagnosis.
- Book Chapter
6
- 10.1007/978-981-19-7874-6_46
- Jan 1, 2023
Diabetic retinopathy (DR) is one of the most important and embarrassing problems in the medical, psychological, and social aspects of the working-age population in the world. The DR severity classification problem for single modality (with image input) model and multi modality (with image and text inputs) model is considered on the basis of RetinaMNIST dataset. The influence of additional data like subjective “patient” opinion or “expert” opinions about patient health state (that provide “data leakage” on some classes) can be helpful in some practical situations. These opinions were simulated by additional (augmented) metadata from simulated questionnaires. As a result the following variants of input values and the correspondent models were prepared: single modality model (SM) with input images only, and multi modality models with input images and patient opinion text like Multi modality model with Patient opinion (MP), Multi modality model with Expert opinion (ME), and Multi modality model with Patient and Expert opinions (MPE). All these multi modality models (MP, ME, MPE) allowed us to reach the various statistically significant improvements of classification performance by AUC value for all classes in the range from 4% to 27% that are rather beyond the limits of the standard deviation of 2–3% measured by cross-validation and can be estimated as significant ones. In general, this approach based on metadata augmentation, namely, usage of the additional modalities with “data leakage” on the extreme classes, for example, with the lowest (Class 0) and highest (Class 4) DR severity, and their combinations could be useful strategy for the better classification of some hardly classified DR severities like Classes 1–3 here and in the more general context.
- Research Article
4
- 10.2196/72822
- May 12, 2025
- Journal of medical Internet research
A major challenge in sentiment analysis on social media is the increasing prevalence of image-based content, which integrates text and visuals to convey nuanced messages. Traditional text-based approaches have been widely used to assess public attitudes and beliefs; however, they often fail to fully capture the meaning of multimodal content where cultural, contextual, and visual elements play a significant role. This study aims to provide practical guidance for collecting, processing, and analyzing social media data using multimodal machine learning models. Specifically, it focuses on training and fine-tuning models to classify sentiment and detect hate speech. Social media data were collected from Facebook and Instagram using CrowdTangle, a public insights tool by Meta, and from X via its academic research application programming interface. The dataset was filtered to include only race-related terms and lesbian, gay, bisexual, transgender, queer, intersex, and asexual community-related posts with image attachments, ensuring focus on multimodal content. Human annotators labeled 13,000 posts into 4 categories: negative sentiment, positive sentiment, hate, or antihate. We evaluated unimodal (Bidirectional Encoder Representations from Transformers for text and Visual Geometry Group 16 for images) and multimodal (Contrastive Language-Image Pretraining [CLIP], Visual Bidirectional Encoder Representations from Transformers [VisualBERTs], and an intermediate fusion) models. To enhance model performance, the synthetic minority oversampling technique was applied to address class imbalances, and latent Dirichlet allocation was used to improve semantic representations. Our findings highlighted key differences in model performance. Among unimodal models, Bidirectional Encoder Representations from Transformer outperformed Visual Geometry Group 16, achieving higher accuracy and macro-F1-scores across all tasks. Among multimodal models, CLIP achieved the highest accuracy (0.86) in negative sentiment detection, followed by VisualBERT (0.84). For positive sentiment, VisualBERT outperformed other models with the highest accuracy (0.76). In hate speech detection, the intermediate fusion model demonstrated the highest accuracy (0.91) with a macro-F1-score of 0.64, ensuring balanced performance. Meanwhile, VisualBERT performed best in antihate classification, achieving an accuracy of 0.78. Applying latent Dirichlet allocation and the synthetic minority oversampling technique improved minority class detection, particularly for antihate content. Overall, the intermediate fusion model provided the most balanced performance across tasks, while CLIP excelled in accuracy-driven classifications. Although VisualBERT performed well in certain areas, it struggled to maintain a precision-recall balance. These results emphasized the effectiveness of multimodal approaches over unimodal models in analyzing social media sentiment. This study contributes to the growing research on multimodal machine learning by demonstrating how advanced models, data augmentation techniques, and diverse datasets can enhance the analysis of social media content. The findings offer valuable insights for researchers, policy makers, and public health professionals seeking to leverage artificial intelligence for social media monitoring and addressing broader societal challenges.
- Research Article
- 10.1016/j.acra.2026.03.044
- Apr 1, 2026
- Academic radiology
Deep Learning-Based Multimodal Fusion of Ultrasound, Cytology, and Clinical Features to Distinguish Follicular Thyroid Carcinoma from Adenoma: A Multicenter Study.
- Research Article
- 10.1186/s12877-026-07005-9
- Jan 26, 2026
- BMC Geriatrics
Early detection and treatment of sarcopenia are crucial for improving patient outcomes, yet current diagnostic methods often lack the accuracy, accessibility, and efficiency needed for widespread clinical use. The aim of this study was to develop an accurate, secure, and evidence-based multimodal AI model using a point-of-care ultrasound (POCUS) framework combining muscle imaging properties with physical performance for sarcopenia diagnosis. The model uses clinical data and POCUS images. Clinical data consisted of age, gender, height, weight, body mass index (BMI) and data on physical performance by Short Physical Performance Battery (SPPB) scores. SPPB scores were chosen since it is recommended by both the European Working Group of Sarcopenia in Older People 2 and the Asian Working Group for Sarcopenia. POCUS data consisted of images from the dominant thigh, focusing on the rectus femoris muscle, using longitudinal and transverse projections. Various Machine Learning (ML) and Deep Learning (DL) algorithms and multimodal architectures were tested. Explainable AI (XAI) methods, including Grad-CAM for ultrasound images and feature-attribution analysis for clinical variables, were integrated to provide transparent interpretation of the multimodal model’s diagnostic decisions. The final model was implemented as part of the Sarcopenia Artificial Intelligence Diagnostic Decision Support System (SAID DSS). Participants (24) were mostly women (63%) with a mean age of 81 years (± 5.2), (age range: 71–91 years) a mean body mass index of 26 kg/m2 (± 4.1), and mean SPPB scores of 5 (± 1.6) and 9 (± 1.6) for sarcopenic and controls. 1060 and 2414 longitudinal and transverse ultrasound events for sarcopenic and control participants, respectively, were used, demonstrating a robust dataset despite the small number of participants. Comprehensive experimental results showed that a feature-level fusion technique using a multilayer perceptron network as classifier and Xception architectures for image feature extraction demonstrated the best performance. The final model yielded a diagnostic accuracy of 85%, an F1-score of 0.85 and an area under the curve (AUC) of 0.84, higher than previous models. This study is the first to introduce a clinically oriented, AI-based multimodal model for sarcopenia detection, demonstrating improved performance over existing approaches. In addition, we provided an explanation of the decisions generated by the best-performing detection model. By integrating this model into SAID DSS, we provide a practical and scalable tool with potential for direct application in clinical workflows, supporting early and accurate identification of sarcopenia. Not applicable.
- Research Article
- 10.21037/jtd-2025-1214
- Nov 24, 2025
- Journal of Thoracic Disease
BackgroundGlobally, lung cancer is the most frequently diagnosed malignancy, for which solid pulmonary nodules (SPNs) are a common radiographic finding. Given the high false-positive rates of computed tomography (CT) screening, we aimed to develop a multimodal diagnostic model combining CT radiomics features and serum biomarkers via machine learning.MethodsThis retrospective study included patients receiving both preoperative CT screening and serum biomarker testing. All pulmonary nodules (PNs) were divided into training and validation sets randomly at a ratio of 7:3. We developed a multimodal diagnosis model based on the CT radiomics and protein biomarkers of SPNs in the training cohort. The CT radiomics features were derived from the integration of traditional radiomics analysis methods and three-dimensional (3D) deep learning techniques. The accuracy of this multimodal diagnosis model for the prediction of SPNs was verified in the validation set. Model performances were evaluated in terms of the area under the curve (AUC), accuracy, positive predictive value (PPV), negative predictive value (NPV), decision curve analysis (DCA), and calibration curve.ResultsBetween February 2016 and December 2020, imaging data of 638 eligible PNs from CT scans of 633 different patients were collected. The multimodal model had satisfactory accuracy in differentiating benign and malignant SPNs in the training set [AUC =0.944; 95% confidence interval (CI): 0.924–0.964]. In the validation set, the multimodal model yielded an AUC of 0.926 (95% CI: 0.889–0.964), an accuracy of 0.885, an NPV of 0.812, and a PPV of 0.927. The multimodal model also significantly outperformed the single-modality diagnostic models, including the traditional radiomics CT model (AUC =0.843; 95% CI: 0.780–0.906), the serum biomarker model (AUC =0.783; 95% CI: 0.718–0.847), and the 3D deep learning model (AUC =0.820; 95% CI: 0.754–0.885) (all P values <0.01).ConclusionsThis study developed a novel multimodal that demonstrated superior performance in classifying SPNs. It may thus enhance the diagnosis of benign and malignant lesions and provide support for clinical decision-making.
- Research Article
4
- 10.1016/j.jdent.2023.104588
- Jun 21, 2023
- Journal of Dentistry
Multi-modal deep learning for automated assembly of periapical radiographs
- Research Article
- 10.1007/s00330-025-12315-4
- Jan 30, 2026
- European radiology
To develop and validate a multimodal deep learning model integrating clinical data, contrast-enhanced CT, and laryngoscopic images for differentiating early-stage (I-II) from advanced-stage (III-IV) laryngeal squamous cell carcinoma (LSCC). This retrospective multicenter study included 450 patients with pathologically confirmed LSCC from two Chinese medical centers. All patients had contrast-enhanced CT, white-light laryngoscopy, and clinical records. They were divided into training (n = 235), internal validation (n = 101), and external validation (n = 114) cohorts. Three single-modality models (CT-based deep learning [CT-DL], laryngoscopy-based multiple instance learning [L-MIL], and a clinical logistic regression model [CL]) and their combinations were compared. A feature-level fusion strategy was applied, and the final integrated multimodal model (CL + CT + L) was built using a stochastic gradient descent (SGD) classifier. Performance was evaluated by AUC, accuracy, sensitivity, specificity, calibration, and decision curve analysis (DCA), with prognostic value assessed by Kaplan-Meier and concordance index (C-index). A total of 450 patients were included (median age, 62 years [range, 31-88]; 365 men). The integrated multimodal model achieved AUCs of 0.902 (0.833-0.954) in the internal cohort and 0.888 (0.826-0.944) in the external cohort, outperforming all single- and dual-modality models (p < 0.05). Calibration and DCA confirmed strong consistency and clinical utility. The model categorized patients into distinct risk groups, which exhibited notable differences in progression-free survival (C-index = 0.584, p = 0.036). The integrated multimodal model showed high accuracy and generalizability for preoperative LSCC staging and may aid individualized treatment planning. Question Can a multimodal deep learning model combining clinical, CT, and laryngoscopic data improve preoperative staging accuracy of LSCC? Findings The integrated multimodal model achieved higher diagnostic accuracy and provided reliable prognostic stratification compared with conventional approaches. Clinical relevance This multimodal model offers a non-invasive, accurate, and generalizable tool for LSCC staging, supporting individualized treatment planning and enhancing patient management.
- Research Article
1
- 10.3389/fradi.2025.1698680
- Nov 25, 2025
- Frontiers in Radiology
BackgroundAortic stenosis (AS) is diagnosed by echocardiography, the current gold standard, but examinations are often performed only after symptoms emerge, highlighting the need for earlier detection. Recently, artificial intelligence (AI)–based screening using non-invasive and widely available modalities such as electrocardiography (ECG) and chest x-ray(CXR) has gained increasing attention for valvular heart disease. However, single-modality approaches have inherent limitations, and in clinical practice, multimodality assessment is common. In this study, we developed a multimodal AI model integrating ECG and CXR within a cooperative learning framework to evaluate its utility for earlier detection of AS.MethodsWe retrospectively analyzed 23,886 patient records from 7,483 patients who underwent ECG, CXR, and echocardiography. A multimodal model was developed by combining a 1D ResNet50–Transformer architecture for ECG data with an EfficientNet-based architecture for CXR. Cooperative learning was implemented using a loss function that allowed the ECG and CXR models to refine each other's predictions. We split the dataset into training, validation, and test sets, and performed 1,000 bootstrap iterations to assess model stability. AS was defined echocardiographically as peak velocity ≥2.5 m/s, mean pressure gradient ≥20 mmHg, or aortic valve area ≤1.5 cm2.ResultsAmong 7,483 patients, 608 (8.1%) were diagnosed with AS. The multimodal model achieved a test AUROC of 0.812 (95% CI: 0.792–0.832), outperforming the ECG model (0.775, 95% CI: 0.753–0.796) and the CXR model (0.755, 95% CI: 0.732–0.777). Visualization techniques (Grad-CAM, Transformer attention) highlighted distinct yet complementary features in AS patients.ConclusionsThe multimodal AI model via cooperative learning outperformed single-modality methods in AS detection and may aid earlier diagnosis and reduce clinical burden.
- Research Article
2
- 10.1016/j.clinimag.2024.110254
- Aug 9, 2024
- Clinical Imaging
Development of a multi-modal learning-based lymph node metastasis prediction model for lung cancer
- Research Article
21
- 10.1164/rccm.202304-0767oc
- Jul 15, 2024
- American journal of respiratory and critical care medicine
Rationale: The incidence of clinically undiagnosed obstructive sleep apnea (OSA) is high among the general population because of limited access to polysomnography. Computed tomography (CT) of craniofacial regions obtained for other purposes can be beneficial in predicting OSA and its severity. Objectives: To predict OSA and its severity based on paranasal CT using a three-dimensional deep learning algorithm. Methods: One internal dataset (N = 798) and two external datasets (N = 135 and N = 85) were used in this study. In the internal dataset, 92 normal participants and 159 with mild, 201 with moderate, and 346 with severe OSA were enrolled to derive the deep learning model. A multimodal deep learning model was elicited from the connection between a three-dimensional convolutional neural network-based part treating unstructured data (CT images) and a multilayer perceptron-based part treating structured data (age, sex, and body mass index) to predict OSA and its severity. Measurements and Main Results: In a four-class classification for predicting the severity of OSA, the AirwayNet-MM-H model (multimodal model with airway-highlighting preprocessing algorithm) showed an average accuracy of 87.6% (95% confidence interval [CI], 86.8-88.6%) in the internal dataset and 84.0% (95% CI, 83.0-85.1%) and 86.3% (95% CI, 85.3-87.3%) in the two external datasets, respectively. In the two-class classification for predicting significant OSA (moderate to severe OSA), the area under the receiver operating characteristic curve, accuracy, sensitivity, specificity, and F1 score were 0.910 (95% CI, 0.899-0.922), 91.0% (95% CI, 90.1-91.9%), 89.9% (95% CI, 88.8-90.9%), 93.5% (95% CI, 92.7-94.3%), and 93.2% (95% CI, 92.5-93.9%), respectively, in the internal dataset. Furthermore, the diagnostic performance of the Airway Net-MM-H model outperformed that of the other six state-of-the-art deep learning models in terms of accuracy for both four- and two-class classifications and area under the receiver operating characteristic curve for two-class classification (P < 0.001). Conclusions: A novel deep learning model, including a multimodal deep learning model and an airway-highlighting preprocessing algorithm from CT images obtained for other purposes, can provide significantly precise outcomes for OSA diagnosis.
- Research Article
5
- 10.2215/cjn.0000000695
- Apr 15, 2025
- Clinical journal of the American Society of Nephrology : CJASN
Prior models for the early identification of acute kidney injury (AKI) have utilized structured data (e.g., vital signs and laboratory values). We aimed to develop and validate a deep learning model to predict moderate to severe AKI by combining structured data and information from unstructured notes. Adults (≥18 years) admitted to the University of Wisconsin (2009-20) and the University of Chicago Medicine (2016-22) were eligible for inclusion. Patients were excluded if they had no documented serum creatinine (SCr), end-stage kidney disease, an admission SCr≥3.0mg/dL, developed ≥Stage 2 AKI before reaching the wards or intensive care unit (ICU), or required dialysis (KRT) within the first 48 hours. Text from unstructured notes was mapped to standardized Concept Unique Identifiers (CUIs) to create predictor variables, and structured data variables were also included. An intermediate fusion deep learning recurrent neural network architecture was used to predict ≥Stage 2 AKI within the next 48 hours. This multimodal model was developed in the first 80% of the data and temporally validated in the next 20%. There were 339,998 admissions in the derivation cohort and 84,581 in the validation cohort, with 12,748 (3%) developing ≥Stage 2 AKI. Patients with ≥Stage 2 AKI were older, more likely to be male, had higher baseline SCr, and were more commonly in the ICU (p<0.001 for all). The multimodal model outperformed a model based only on structured data for all outcomes, with an area under the receiver operating characteristic curve (95% CI) of 0.88(0.88-0.88) for predicting ≥Stage 2 AKI and 0.93(0.93-0.94) for receiving KRT. The area under the precision-recall-curve for ≥Stage 2 AKI was 0.20. Results were similar during external validation. We developed and validated a multimodal deep learning model using structured and unstructured data that predicts the development of severe AKI across the hospital stay for earlier intervention.