Multi-modal deep learning for automated assembly of periapical radiographs
Multi-modal deep learning for automated assembly of periapical radiographs
- Preprint Article
- 10.5194/egusphere-egu23-5818
- May 15, 2023
In general, water level prediction models using deep learning techniques have been developed using time-series water level observation data from upstream water level stations and target water level stations even though many of physical data are necessary to predict water level. The changes of the water level are greatly affected by rainfall in the basin, therefore rainfall information is needed to more accurately predict the water level. In particular, radar data has the advantage of being able to directly acquire the amount of rainfall occurring within a watershed. This study aims to develop the multimodal deep learning model to predict the water level using 2D grid radar rainfall data and 1D time-series water level observation data. This study proposed two multimodal deep learning models which have different structures. Both multimodal deep learning models predict the water level by simultaneously using the observed water level data up to the present time and the radar rainfall data that affects the water level in the future. The first proposed model consists of a deep learning network that links 2D Average Pooling (AvgPool2D), which compresses 2D radar data to 1D data, and Long Short-Term Memory (LSTM), which predicts 1D time series water level data. The second proposed model consists of a deep learning network that predicts water levels by linking Conv2DLSTM and LSTM, which can reflect the characteristics of 2D radar data without deformation.  The two proposed multimodal deep learning models were learned and evaluated in the upper basin of Hantan River. In addition, it was compared with the results of single-modal LSTM using only water level data. There are three water level stations in the study area, and the objective was to predict the water level of the downstream station up to 180 minutes in advance. For learning and verification of the deep learning model, 10-minute water level and radar rainfall data were collected from May 2019 to October 2021. For the radar data used as input, the grid data included in the target watershed were extracted and used among composite radar data with a resolution of 1 km operating by Ministry of Environment. As a result of evaluating each learned deep learning model, two multimodal models had higher prediction accuracy than the single-modal using only water level data. In particular, second proposed model (Conv2dLSTM+LSTM) had better predictive performance than first proposed model (AvgPool2D+LSTM) at the time of the sudden rise in water level due to rainfall.AcknowledgmentsResearch for this paper was carried out under the KICT Research Program (project no. 202200175-001, Development of future-leading technologies solving water crisis against to water disasters affected by climate change) funded by the Ministry of Science and ICT.
- Research Article
13
- 10.1007/s00261-024-04202-1
- Mar 3, 2024
- Abdominal radiology (New York)
To investigate the value of a multimodal deep learning (MDL) model based on computed tomography (CT) and magnetic resonance imaging (MRI) for predicting microvascular invasion (MVI) in hepatocellular carcinoma (HCC). A total of 287 patients with HCC from our institution and 58 patients from another individual institution were included. Among these, 119 patients with only CT data and 116 patients with only MRI data were selected for single-modality deep learning model development, after which select parameters were migrated for MDL model development with transfer learning (TL). In addition, 110 patients with simultaneous CT and MRI data were divided into a training cohort (n = 66) and a validation cohort (n = 44). We input the features extracted from DenseNet121 into an extreme learning machine (ELM) classifier to construct a classification model. The area under the curve (AUC) of the MDL model was 0.844, which was superior to that of the single-phase CT (AUC = 0.706-0.776, P < 0.05), single-sequence MRI (AUC = 0.706-0.717, P < 0.05), single-modality DL model (AUCall-phase CT = 0.722, AUCall-sequence MRI = 0.731; P < 0.05), clinical (AUC = 0.648, P < 0.05), but not to that of the delay phase (DP) and in-phase (IP) MRI and portal venous phase (PVP) CT models. The MDL model achieved better performance than models described above (P < 0.05). When combined with clinical features, the AUC of the MDL model increased from 0.844 to 0.871. A nomogram, combining deep learning signatures (DLS) and clinical indicators for MDL models, demonstrated a greater overall net gain than the MDL models (P < 0.05). The MDL model is a valuable noninvasive technique for preoperatively predicting MVI in HCC.
- Research Article
2
- 10.1186/s13058-025-02129-z
- Jan 1, 2025
- Breast Cancer Research : BCR
BackgroundProper stratification of recurrence risk in breast cancer is crucial for guiding treatment decisions. This study aims to predict the recurrence risk of breast cancer patients using a multimodal deep learning model that integrates multiple sequence MRI imaging features with clinicopathologic characteristics.MethodsIn this retrospective study, we enrolled 574 patients with non-metastatic invasive breast cancer from two Chinese institutions between September 2012 and July 2019. We developed a multimodal deep learning (MDL) model by constructing a multi-instance learning framework based on convolutional neural networks. We integrated imaging features from T2WI, DWI, and DCE-MRI sequences with clinicopathologic features for breast cancer recurrence risk stratification. Subsequently, the performance of the MDL model was evaluated using receiver operating characteristic (ROC) curves, the Hosmer–Lemeshow test, calibration curves, and decision curve analysis (DCA). Survival analysis was conducted with Kaplan–Meier survival curves to stratify breast cancer patients into high and low-recurrence risk groups. Time-dependent ROC curves were used to assess 3-year, 5-year, and 7-year recurrence-free survival (RFS) for breast cancer patients. Additionally, we performed differential and enrichment analyses on Oncotype DX genes. We correlated these genes with clinicopathologic features and deep-learning radiographic features using univariate Cox regression and Pearson correlation analysis.ResultsThe MDL model demonstrated good performance in predicting breast cancer recurrence risk and accurately differentiated between high- and low-recurrence risk groups, with an AUC as high as 0.915 (95% CI 0.8448–0.9856). The C-index of prediction models was 0.803 in the testing cohort. The AUCs for 5-year and 7-year RFS were 0.936 (95% CI 0.876–0.997) and 0.956 (95% CI 0.902–1.000) in the validation cohort. In the testing cohort, these AUCs were 0.836 (95% CI 0.763–0.909) and 0.783 (95% CI 0.676–0.891). This study found a significant correlation between Oncotype DX gene expression, clinicopathologic features, and deep-learning radiographic features (p < 0.05).ConclusionsThis study validated the robust predictive accuracy of the MDL model in identifying high- and low-risk groups for recurrence. The correlations identified between Oncotype DX genes, clinicopathologic features, and deep-learning radiographic features offer novel insights for future biomarker research in breast cancer.Supplementary InformationThe online version contains supplementary material available at 10.1186/s13058-025-02129-z.
- Research Article
19
- 10.1007/s00330-022-09031-8
- Aug 27, 2022
- European Radiology
The prediction of primary treatment failure (PTF) is necessary for patients with diffuse large B-cell lymphoma (DLBCL) since it serves as a prominent means for improving front-line outcomes. Using interim 18F-fluoro-2-deoxyglucose ([18F]FDG) positron emission tomography/computed tomography (PET/CT) imaging data, we aimed to construct multimodal deep learning (MDL) models to predict possible PTF in low-risk DLBCL. Initially, 205 DLBCL patients undergoing interim [18F]FDG PET/CT scans and the front-line standard of care were included in the primary dataset for model development. Then, 44 other patients were included in the external dataset for generalization evaluation. Based on the powerful backbone of the Conv-LSTM network, we incorporated five different multimodal fusion strategies (pixel intermixing, separate channel, separate branch, quantitative weighting, and hybrid learning) to make full use of PET/CT features and built five corresponding MDL models. Moreover, we found the best model, that is, the hybrid learning model, and optimized it by integrating the contrastive training objective to further improve its prediction performance. The final model with contrastive objective optimization, named the contrastive hybrid learning model, performed best, with an accuracy of 91.22% and an area under the receiver operating characteristic curve (AUC) of 0.926, in the primary dataset. In the external dataset, its accuracy and AUC remained at 88.64% and 0.925, respectively, indicating its good generalization ability. The proposed model achieved good performance, validated the predictive value of interim PET/CT, and holds promise for directing individualized clinical treatment. • The proposed multimodal models achieved accurate prediction of primary treatment failure in DLBCL patients. • Using an appropriate feature-level fusion strategy can make the same class close to each other regardless of the modal heterogeneity of the data source domain and positively impact the prediction performance. • Deep learning validated the predictive value of interim PET/CT in a way that exceeded human capabilities.
- Research Article
26
- 10.3390/s22197328
- Sep 27, 2022
- Sensors
This paper introduces a new dataset of a surgical knot-tying task, and a multi-modal deep learning model that achieves comparable performance to expert human raters on this skill assessment task. Seventy-two surgical trainees and faculty were recruited for the knot-tying task, and were recorded using video, kinematic, and image data. Three expert human raters conducted the skills assessment using the Objective Structured Assessment of Technical Skill (OSATS) Global Rating Scale (GRS). We also designed and developed three deep learning models: a ResNet-based image model, a ResNet-LSTM kinematic model, and a multi-modal model leveraging the image and time-series kinematic data. All three models demonstrate performance comparable to the expert human raters on most GRS domains. The multi-modal model demonstrates the best overall performance, as measured using the mean squared error (MSE) and intraclass correlation coefficient (ICC). This work is significant since it demonstrates that multi-modal deep learning has the potential to replicate human raters on a challenging human-performed knot-tying task. The study demonstrates an algorithm with state-of-the-art performance in surgical skill assessment. As objective assessment of technical skill continues to be a growing, but resource-heavy, element of surgical education, this study is an important step towards automated surgical skill assessment, ultimately leading to reduced burden on training faculty and institutes.
- Research Article
15
- 10.1186/s12911-021-01700-w
- Nov 27, 2021
- BMC Medical Informatics and Decision Making
BackgroundAn increase in the incidence of central venous catheter (CVC)-associated deep venous thrombosis (CADVT) has been reported in pediatric patients over the past decade. At the same time, current screening guidelines for venous thromboembolism risk have low sensitivity for CADVT in hospitalized children. This study utilized a multimodal deep learning model to predict CADVT before it occurs.MethodsChildren who were admitted to intensive care units (ICUs) between December 2015 and December 2018 and with CVC placement at least 3 days were included. The variables analyzed included demographic characteristics, clinical conditions, laboratory test results, vital signs and medications. A multimodal deep learning (MMDL) model that can handle temporal data using long short-term memory (LSTM) and gated recurrent units (GRUs) was proposed for this prediction task. Four benchmark machine learning models, logistic regression (LR), random forest (RF), gradient boosting decision tree (GBDT) and a published cutting edge MMDL, were used to compare and evaluate the models with a fivefold cross-validation approach. Accuracy, recall, area under the ROC curve (AUC), and average precision (AP) were used to evaluate the discrimination of each model at three time points (24 h, 48 h and 72 h) before CADVT occurred. Brier score and Spiegelhalter’s z test were used measure the calibration of these prediction models.ResultsA total of 1830 patients were included in this study, and approximately 15% developed CADVT. In the CADVT prediction task, the model proposed in this paper significantly outperforms both traditional machine learning models and existing multimodal deep learning models at all 3 time points. It achieved 77% accuracy and 90% recall at 24 h before CADVT was discovered. It can be used to accurately predict the occurrence of CADVT 72 h in advance with an accuracy of greater than 75%, a recall of more than 87%, and an AUC value of 0.82.ConclusionIn this study, a machine learning method was successfully established to predict CADVT in advance. These findings demonstrate that artificial intelligence (AI) could provide measures for thromboprophylaxis in a pediatric intensive care setting.
- Research Article
3
- 10.1007/s10278-025-01566-8
- Jun 24, 2025
- Journal of imaging informatics in medicine
This study aimed to develop and validate a multimodal deep learning model that leverages 2D grayscale ultrasound (US) images alongside readily available clinical data to improve diagnostic performance for ovarian cancer (OC). A retrospective analysis was conducted involving 1899 patients who underwent preoperative US examinations and subsequent surgeries for adnexal masses between 2019 and 2024. A multimodal deep learning model was constructed for OC diagnosis and extracting US morphological features from the images. The model's performance was evaluated using metrics such as receiver operating characteristic (ROC) curves, accuracy, and F1 score. The multimodal deep learning model exhibited superior performance compared to the image-only model, achieving areas under the curves (AUCs) of 0.9393 (95% CI 0.9139-0.9648) and 0.9317 (95% CI 0.9062-0.9573) in the internal and external test sets, respectively. The model significantly improved the AUCs for OC diagnosis by radiologists and enhanced inter-reader agreement. Regarding US morphological feature extraction, the model demonstrated robust performance, attaining accuracies of 86.34% and 85.62% in the internal and external test sets, respectively. Multimodal deep learning has the potential to enhance the diagnostic accuracy and consistency of radiologists in identifying OC. The model's effective feature extraction from ultrasound images underscores the capability of multimodal deep learning to automate the generation of structured ultrasound reports.
- Dissertation
- 10.32657/10356/182346
- Jan 1, 2025
Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.
- Research Article
81
- 10.1016/j.isprsjprs.2021.11.023
- Dec 6, 2021
- ISPRS Journal of Photogrammetry and Remote Sensing
Performance of deep learning in mapping water quality of Lake Simcoe with long-term Landsat archive
- Research Article
1
- 10.1158/1538-7445.am2024-2313
- Mar 22, 2024
- Cancer Research
Purpose: Cancer patients routinely undergo radiologic and pathologic evaluation for their diagnostic workup. These data modalities represent a valuable and readily available resource for developing new prognostic tools. Given their vast difference in spatial scales, effective methods to integrate the two modalities are currently lacking. Here, we aim to develop a multi-modal approach to integrate radiology and pathology images for predicting outcomes in cancer patients. Methods: We propose a multi-modal weakly-supervised deep learning framework to integrate radiology and pathology images for survival prediction. We first extract multi-scale features from whole-slide H&E-stained pathology images to characterize cellular and tissue phenotypes as well as spatial cellular organization. We then build a hierarchical co-attention transformer to effectively learn the multi-modal interactions between radiology and pathology image features. Finally, a multimodal risk score is derived by combining complementary information from two images modalities and clinical data for predicting outcome. We evaluate our approach in lung, gastric, and brain cancers with matched radiology and pathology images and clinical data available, each with separate training and external validation cohorts. Results: The multi-modal deep learning models achieved a reasonably high accuracy for predicting survival outcomes in the external validation cohorts (C-index range: 0.72-0.75 across three cancer types). The multi-modal prognostic models significantly improved upon single-modal approach based on radiology or pathology images or clinical data alone (C-index range: 0.53-0.71, P&lt;0.01). The multi-modal deep learning models were significantly associated with disease-free survival and overall survival (hazard ratio range: 3.23-4.46, P&lt;0.0001). In multivariable analyses, the models remained an independent prognostic factor (P&lt;0.01) after adjusting for clinicopathological variables including cancer stage and tumor differentiation. Conclusions: The proposed multi-modal deep learning approach outperforms traditional methods for predicting survival outcomes by leveraging routinely available radiology and pathology images. With further independent validation, this may afford a promising approach to improve risk stratification and better inform treatment strategies for cancer patients. Citation Format: Zhe Li, Yuming Jiang, Ruijiang Li. Multi-modal deep learning to predict cancer outcomes by integrating radiology and pathology images [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2313.
- Research Article
- 10.3389/fonc.2026.1767885
- Mar 27, 2026
- Frontiers in oncology
This study aimed to develop a multimodal deep learning (MD DL) model integrating multiphasic computed tomography (CT) with clinical and laboratory parameters to predict early recurrence of hepatocellular carcinoma (HCC) following liver transplantation. A retrospective analysis was conducted on 147 patients with HCC who underwent liver transplantation at Tianjin First Central Hospital between June 2014 and September 2022. Patients were categorized into recurrence (n = 40) and non-recurrence (n = 107) groups. Independent risk factors for early recurrence were identified to construct a clinical-imaging model. Deep learning models were developed using both single-phase and multiphasic CT images. High-dimensional imaging features were combined with clinicoradiological parameters to establish the MD DL model. Model performance was evaluated using receiver operating characteristic curves and the DeLong test, while interpretability was assessed through SHapley Additive explanation (SHAP) analysis. Independent risk factors for early recurrence included platelet count, alpha-fetoprotein levels > 400 ng/mL, ascites, arterial peritumoral enhancement, and portal vein tumor thrombus. The MD DL model achieved area under the curve values of 0.972, 0.885, and 0.985 in the training, validation, and test sets, respectively. These values indicated significantly superior predictive performance compared with other models (all p < 0.05). SHAP analysis identified key predictive features contributing to model performance. The MD DL model integrating multiphasic CT and clinical parameters demonstrated high predictive accuracy for early recurrence of HCC after liver transplantation, with diagnostic performance exceeding that of conventional models.
- Research Article
1
- 10.1101/2025.08.08.25333333
- Aug 12, 2025
- medRxiv
Objective:Poor outcomes in acute respiratory distress syndrome (ARDS) can be alleviated with tools that support early diagnosis. Current machine learning methods for detecting ARDS do not take full advantage of the multimodality of ARDS pathophysiology. We developed a multimodal deep learning model that uses imaging data, continuously collected ventilation data, and tabular data derived from a patient’s electronic health record (EHR) to make ARDS predictions.Materials and Methods:A chest radiograph (x-ray), at least two hours of ventilator waveform (VWD) data within the first 24 hours of intubation, and EHR-derived tabular data were used from 220 patients admitted to the ICU to train a deep learning model. The model uses pretrained encoders for the x-rays and ventilation data and trains a feature extractor on tabular data. Encoded features for a patient are combined to make a single ARDS prediction. Ablation studies for each modality assessed their effect on the model’s predictive capability.Results:The trimodal model achieved an area under the receiver operator curve (AUROC) of 0.86 with a 95% confidence interval of 0.01. This was a statistically significant improvement (p<0.05) over single modality models and bimodal models trained on VWD+tabular and VWD+x-ray data.Discussion and Conclusion:Our results demonstrate the potential utility of using deep learning to address complex conditions with heterogeneous data. More work is needed to determine the additive effect of modalities on ARDS detection. Our framework can serve as a blueprint for building performant multimodal deep learning models for conditions with small, heterogeneous datasets.
- Research Article
27
- 10.2196/54363
- May 2, 2024
- Journal of Medical Internet Research
BackgroundClinical notes contain contextualized information beyond structured data related to patients’ past and current health status.ObjectiveThis study aimed to design a multimodal deep learning approach to improve the evaluation precision of hospital outcomes for heart failure (HF) using admission clinical notes and easily collected tabular data.MethodsData for the development and validation of the multimodal model were retrospectively derived from 3 open-access US databases, including the Medical Information Mart for Intensive Care III v1.4 (MIMIC-III) and MIMIC-IV v1.0, collected from a teaching hospital from 2001 to 2019, and the eICU Collaborative Research Database v1.2, collected from 208 hospitals from 2014 to 2015. The study cohorts consisted of all patients with critical HF. The clinical notes, including chief complaint, history of present illness, physical examination, medical history, and admission medication, as well as clinical variables recorded in electronic health records, were analyzed. We developed a deep learning mortality prediction model for in-hospital patients, which underwent complete internal, prospective, and external evaluation. The Integrated Gradients and SHapley Additive exPlanations (SHAP) methods were used to analyze the importance of risk factors.ResultsThe study included 9989 (16.4%) patients in the development set, 2497 (14.1%) patients in the internal validation set, 1896 (18.3%) in the prospective validation set, and 7432 (15%) patients in the external validation set. The area under the receiver operating characteristic curve of the models was 0.838 (95% CI 0.827-0.851), 0.849 (95% CI 0.841-0.856), and 0.767 (95% CI 0.762-0.772), for the internal, prospective, and external validation sets, respectively. The area under the receiver operating characteristic curve of the multimodal model outperformed that of the unimodal models in all test sets, and tabular data contributed to higher discrimination. The medical history and physical examination were more useful than other factors in early assessments.ConclusionsThe multimodal deep learning model for combining admission notes and clinical tabular data showed promising efficacy as a potentially novel method in evaluating the risk of mortality in patients with HF, providing more accurate and timely decision support.
- Research Article
23
- 10.3390/app122010405
- Oct 15, 2022
- Applied Sciences
Many forms of air pollution increase as science and technology rapidly advance. In particular, fine dust harms the human body, causing or worsening heart and lung-related diseases. In this study, the level of fine dust in Seoul after 8 h is predicted to prevent health damage in advance. We construct a dataset by combining two modalities (i.e., numerical and image data) for accurate prediction. In addition, we propose a multimodal deep learning model combining a Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN). An LSTM AutoEncoder is chosen as a model for numerical time series data processing and basic CNN. A Visual Geometry Group Neural Network (VGGNet) (VGG16, VGG19) is also chosen as a CNN model for image processing to compare performance differences according to network depth. The VGGNet is a standard deep CNN architecture with multiple layers. Our multimodal deep learning model using two modalities (i.e., numerical and image data) showed better performance than a single deep learning model using only one modality (numerical data). Specifically, the performance improved up to 14.16% when the VGG19 model, which has a deeper network, was used rather than the VGG16 model.
- Research Article
21
- 10.1164/rccm.202304-0767oc
- Jul 15, 2024
- American journal of respiratory and critical care medicine
Rationale: The incidence of clinically undiagnosed obstructive sleep apnea (OSA) is high among the general population because of limited access to polysomnography. Computed tomography (CT) of craniofacial regions obtained for other purposes can be beneficial in predicting OSA and its severity. Objectives: To predict OSA and its severity based on paranasal CT using a three-dimensional deep learning algorithm. Methods: One internal dataset (N = 798) and two external datasets (N = 135 and N = 85) were used in this study. In the internal dataset, 92 normal participants and 159 with mild, 201 with moderate, and 346 with severe OSA were enrolled to derive the deep learning model. A multimodal deep learning model was elicited from the connection between a three-dimensional convolutional neural network-based part treating unstructured data (CT images) and a multilayer perceptron-based part treating structured data (age, sex, and body mass index) to predict OSA and its severity. Measurements and Main Results: In a four-class classification for predicting the severity of OSA, the AirwayNet-MM-H model (multimodal model with airway-highlighting preprocessing algorithm) showed an average accuracy of 87.6% (95% confidence interval [CI], 86.8-88.6%) in the internal dataset and 84.0% (95% CI, 83.0-85.1%) and 86.3% (95% CI, 85.3-87.3%) in the two external datasets, respectively. In the two-class classification for predicting significant OSA (moderate to severe OSA), the area under the receiver operating characteristic curve, accuracy, sensitivity, specificity, and F1 score were 0.910 (95% CI, 0.899-0.922), 91.0% (95% CI, 90.1-91.9%), 89.9% (95% CI, 88.8-90.9%), 93.5% (95% CI, 92.7-94.3%), and 93.2% (95% CI, 92.5-93.9%), respectively, in the internal dataset. Furthermore, the diagnostic performance of the Airway Net-MM-H model outperformed that of the other six state-of-the-art deep learning models in terms of accuracy for both four- and two-class classifications and area under the receiver operating characteristic curve for two-class classification (P < 0.001). Conclusions: A novel deep learning model, including a multimodal deep learning model and an airway-highlighting preprocessing algorithm from CT images obtained for other purposes, can provide significantly precise outcomes for OSA diagnosis.