Machine learning-driven gene expression profiling for lung cancer stage determination.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

BackgroundLung cancer remains a leading cause of cancer-related mortality, with accurate staging essential for guiding treatment. Advances in next-generation sequencing (NGS) and machine learning (ML) enable more precise classification, improving on traditional imaging-based methods.ObjectiveThis retrospective study applies XGBoost with cross-validation (CV) to classify early vs. late-stage lung cancer using RNA-Seq data from 993 patients in The Cancer Genome Atlas (TCGA) cohort.MethodsGene selection was conducted using the Wilcoxon rank-sum test on training data, and the XGBoost model was optimized via cross-validation. Model performance was assessed using the Area Under the Curve (AUC), with sensitivity-specificity analysis across classification thresholds.ResultsThe XGBoost model achieved a test AUC of 0.6534, identifying 40 key genes that optimize predictive accuracy while minimizing overfitting. Thresholds of 0.3 and 0.4 were optimal, balancing sensitivity and specificity for clinical application.ConclusionsIntegrating RNA-Seq data with machine learning improves lung cancer staging accuracy. Future research should focus on dataset expansion, model benchmarking, and multi-omics integration to enhance clinical applicability.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 27
  • 10.1186/s12920-018-0413-3
Genomic analyses based on pulmonary adenocarcinoma in situ reveal early lung cancer signature
  • Nov 1, 2018
  • BMC Medical Genomics
  • Dan Li + 6 more

BackgroundNon-small cell lung cancer (NSCLC) represents more than about 80% of the lung cancer. The early stages of NSCLC can be treated with complete resection with a good prognosis. However, most cases are detected at late stage of the disease. The average survival rate of the patients with invasive lung cancer is only about 4%. Adenocarcinoma in situ (AIS) is an intermediate subtype of lung adenocarcinoma that exhibits early stage growth patterns but can develop into invasion.MethodsIn this study, we used RNA-seq data from normal, AIS, and invasive lung cancer tissues to identify a gene module that represents the distinguishing characteristics of AIS as AIS-specific genes. Two differential expression analysis algorithms were employed to identify the AIS-specific genes. Then, the subset of the best performed AIS-specific genes for the early lung cancer prediction were selected by random forest. Finally, the performances of the early lung cancer prediction were assessed using random forest, support vector machine (SVM) and artificial neural networks (ANNs) on four independent early lung cancer datasets including one tumor-educated blood platelets (TEPs) dataset.ResultsBased on the differential expression analysis, 107 AIS-specific genes that consisted of 93 protein-coding genes and 14 long non-coding RNAs (lncRNAs) were identified. The significant functions associated with these genes include angiogenesis and ECM-receptor interaction, which are highly related to cancer development and contribute to the smoking-free lung cancers. Moreover, 12 of the AIS-specific lncRNAs are involved in lung cancer progression by potentially regulating the ECM-receptor interaction pathway. The feature selection by random forest identified 20 of the AIS-specific genes as early stage lung cancer signatures using the dataset obtained from The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples. Of the 20 signatures, two were lncRNAs, BLACAT1 and CTD-2527I21.15 which have been reported to be associated with bladder cancer, colorectal cancer and breast cancer. In blind classification for three independent tissue sample datasets, these signature genes consistently yielded about 98% accuracy for distinguishing early stage lung cancer from normal cases. However, the prediction accuracy for the blood platelets samples was only 64.35% (sensitivity 78.1%, specificity 50.59%, and AUROC 0.747).ConclusionsThe comparison of AIS with normal and invasive tumor revealed diseases-specific genes and offered new insights into the mechanism underlying AIS progression into an invasive tumor. These genes can also serve as the signatures for early diagnosis of lung cancer with high accuracy. The expression profile of gene signatures identified from tissue cancer samples yielded remarkable early cancer prediction for tissues samples, however, relatively lower accuracy for boold platelets samples.

  • Research Article
  • Cite Count Icon 2
  • 10.1200/jco.2022.40.16_suppl.e18582
Not just biology: Comparing social determinants of health in patients diagnosed with late-stage lung, breast, and colon cancer.
  • Jun 1, 2022
  • Journal of Clinical Oncology
  • Dena Rhinehart + 2 more

e18582 Background: Lung, colorectal, and breast cancer account for the majority of cancer deaths in the U.S. Patients with late stage lung cancer have the shortest survival of the three, and lung cancer patients are more likely to be diagnosed at later stages. We undertook this study to compare social determinants of health in patients diagnosed with late stages of lung, breast, and colon cancer to assess the impact they have on health and mortality. Methods: Data from the National Cancer Database was used for this study. We compared factors including insurance, income, and residency among late stage cancer patients (Stage III and IV). We also compared baseline health status measured by comorbidity index. Descriptive statistics were used to compare patient characteristics. Statistical significance was determined on the basis of a two-sided p value < 0.05. All statistical analysis was performed using SAS, version 9.4. Results: Between 2004 - 2016, 3,005,513 patients were diagnosed with lung (1,004,999), breast (1,309,796), and colon (690,718) cancer. The racial make-up of the groups was similar. 72.5% of lung cancer patients were diagnosed at late stage compared to 48.8% of colon and 13.8% of breast cancer patients. Patients with late stage lung cancer were more likely to have income < $38,000, reside in rural locations, and less likely to have private insurance. Late stage lung cancer patients were 2 and 4 times more likely to have at least 2 comorbidities than patients with colon and breast cancer respectively. Conclusions: Patients with lung cancer are disproportionately affected by several negative social determinants of health. The association between smoking and lung cancer may help explain this because the highest smoking rates in the U.S. occur in populations with lower income, low education status, less insurance coverage, and significantly more comorbidities (e.g. COPD, Heart Disease). These patients are dealing with the complex interplay of negative social determinants of health and worse baseline health status causing delays in diagnosis and treatment when compared to other cancers. Recognizing this may allow systems to better support this disadvantaged population and improve access to screening, clinical trial inclusion, and personalized treatment options while working longitudinally to reverse the systemic factors affecting disproportionate tobacco use and lack of healthcare access in for this underserved population.

  • Research Article
  • Cite Count Icon 61
  • 10.1016/j.lungcan.2021.02.006
Calculated indices of volatile organic compounds (VOCs) in exhalation for lung cancer screening and early detection
  • Feb 14, 2021
  • Lung Cancer
  • Xing Chen + 9 more

Calculated indices of volatile organic compounds (VOCs) in exhalation for lung cancer screening and early detection

  • Research Article
  • Cite Count Icon 1
  • 10.21037/tcr-24-2023
Enhanced prognostic prediction of cancer-specific mortality in elderly bladder cancer patients post-radical cystectomy: an XGBoost model study.
  • Mar 1, 2025
  • Translational cancer research
  • Gaowei Li + 1 more

Tumor stage, surgery and age are positively correlated with cancer-specific mortality (CSM) in patients diagnosed with bladder cancer (BCa). In light of the successful application of machine learning to process big data in many fields outside of medicine, we aimed to establish and validate whether machine learning models could improve our ability to predict the development of CSM in elderly BCa patients after radical cystectomy (RC). Data on eligible patients diagnosed with BCa were obtained from the Surveillance, Epidemiology, and End Results database (2000-2021) and divided into training and validation cohorts in a ratio of 7:3. First, risk factors for the development of CSM in patients were identified by Cox regression analysis. Then, iterative testing and tuning through automated hyperparameter optimization and ten-fold cross-validation were performed to generate stable extreme gradient boosting (XGBoost) models with optimal performance. Receiver operating characteristic (ROC) curve, area under the curve (AUC), calibration curve and confusion matrix were used to evaluate the performance of XGBoost model. There were 11,763 patients included, of which 5,788 died from BCa. By the comparison of different machine learning models, the final XGBoost model we constructed showed high accuracy and precision in predicting the development of CSM in BCa patients (6-month CSM: AUC =0.799, 12-month CSM: AUC =0.756, 36-month CSM: AUC =0.746, and 60-month CSM: AUC =0.745). The results of accuracy, precision, recall and F1 score confirmed the superior performance of the XGBoost model. The important scores for clinical characteristics and the Shapley Additive Explanations plots highlighted the importance of key factors: chemotherapy, tumor stage, marital status, and tumor size were the top four factors in all models. Our study validated and confirmed the feasibility and high performance of the XGBoost model in predicting CSM in elderly BCa patients after RC. The potential of machine learning contributes to accurately predict the prognosis of cancer.

  • Research Article
  • 10.1158/1535-7163.targ-15-lb-a05
Abstract LB-A05: Profiling cell free DNA in breast cancer and non-small cell lung cancer using broad NGS assessment
  • Dec 1, 2015
  • Molecular Cancer Therapeutics
  • Nadia Solovieff + 10 more

Introduction: Cell free DNA (cfDNA) has become a promising approach for non-invasive assessment of the tumor genome. Many cfDNA assays target hotspot alterations in a focused set of genes, but do not provide a broad characterization of the cancer. We have developed and optimized a large next generation sequencing (NGS) panel covering the coding regions of over 500 genes. Using this panel, we sequenced cell free DNA from plasma and matched tumor DNA in patients with early stage breast cancer, late stage breast cancer and late stage lung cancer. Methods: Plasma was collected from patients with cancer using a double spin protocol and, when available, matched archival tumor tissue (representing different time interval with blood collection) was obtained. Next generation sequencing libraries were generated from cell free DNA isolated from 70 plasma samples and genomic DNA from 58 matched tumor samples. The NGS libraries were enriched for the gene panel of interest and were sequenced to a targeted depth of 1,000X for plasma and 300X for matched tumors. We optimized parameters of standard bioinformatics tools to robustly call low allelic fraction events, detecting single nucleotide variants down to 1%, as well as indels and copy number alterations. Results: We identified 8 PIK3CA hotspot alterations in plasma from late stage breast and lung cancers, in addition to many alterations across driver genes such as AKT1, EGFR, IDH2, NRAS, PTEN and TP53. In plasma samples from patients with late stage breast cancer, we found 4 ESR1 mutations exclusive to the plasma samples, of which 3 are known resistance mutations to endocrine therapy. Copy number alterations in EGFR, CCND1 and KRAS were also identified in patient plasma. When comparing the number of alterations across tumor stages, we found that late stage breast (mean = 12.5 variants) and lung cancers (mean = 12.5 variants) had a larger number of alterations present in plasma than early stage breast cancers (mean = 4.5 variants). We compared somatic mutations calls in plasma and matched tumor samples and found a concordance of 53%-67% at the variant level across patients with late stage cancers (N = 37 pairs). Higher variant level concordance was observed among plasma-tumor pairs collected less than a year apart (N = 11 pairs; 76%-84%) versus more than 5 years apart (N = 8 pairs; 41%-50%). Conclusion: We have developed and optimized a 500+ gene panel for direct sequencing of cfDNA, and we demonstrate that this broad assessment of circulating tumor DNA can be used for non-invasive characterization of the cancer genome landscape. The number of alterations identified in patient plasma is consistent with higher levels of ctDNA being present in late stage disease than in early stage disease. The time dependent degree of concordance between plasma and tumor collection suggests that cell free DNA assays may provide a more accurate characterization of the current tumor mutational landscape than an archival tumor sample. The identification of plasma specific ESR1 alterations highlights the importance of cfDNA in the context of identifying mechanisms of resistance, particularly for metastatic disease when tumor tissue collection may not be feasible. In addition, a broad NGS panel provides the opportunity to identify lesions unevaluated by targeted assays and to discover resistance mutations. Citation Format: Nadia Solovieff, Matt Hims, Rebecca Leary, Derek Chiang, Caroline Germa, Cristian Massacesi, Samit Hirawat, Stefan J. Scherer, Michael Morrissey, Wendy Winckler, Emmanuelle di Tomaso. Profiling cell free DNA in breast cancer and non-small cell lung cancer using broad NGS assessment. [abstract]. In: Proceedings of the AACR-NCI-EORTC International Conference: Molecular Targets and Cancer Therapeutics; 2015 Nov 5-9; Boston, MA. Philadelphia (PA): AACR; Mol Cancer Ther 2015;14(12 Suppl 2):Abstract nr LB-A05.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.3389/fneur.2020.00364
Non-motor Clinical and Biomarker Predictors Enable High Cross-Validated Accuracy Detection of Early PD but Lesser Cross-Validated Accuracy Detection of Scans Without Evidence of Dopaminergic Deficit.
  • May 11, 2020
  • Frontiers in Neurology
  • Charles Leger + 2 more

Background: Early stage (preclinical) detection of Parkinson's disease (PD) remains challenged yet is crucial to both differentiate it from other disorders and facilitate timely administration of neuroprotective treatment as it becomes available.Objective: In a cross-validation paradigm, this work focused on two binary predictive probability analyses: classification of early PD vs. controls and classification of early PD vs. SWEDD (scans without evidence of dopamine deficit). It was hypothesized that five distinct model types using combined non-motor and biomarker features would distinguish early PD from controls with > 80% cross-validated (CV) accuracy, but that the diverse nature of the SWEDD category would reduce early PD vs. SWEDD CV classification accuracy and alter model-based feature selection.Methods: Cross-sectional, baseline data was acquired from the Parkinson's Progressive Markers Initiative (PPMI). Logistic regression, general additive (GAM), decision tree, random forest and XGBoost models were fitted using non-motor clinical and biomarker features. Randomized train and test data partitions were created. Model classification CV performance was compared using the area under the curve (AUC), sensitivity, specificity and the Kappa statistic.Results: All five models achieved >0.80 AUC CV accuracy to distinguish early PD from controls. The GAM (CV AUC 0.928, sensitivity 0.898, specificity 0.897) and XGBoost (CV AUC 0.923, sensitivity 0.875, specificity 0.897) models were the top classifiers. Performance across all models was consistently lower in the early PD/SWEDD analyses, where the highest performing models were XGBoost (CV AUC 0.863, sensitivity 0.905, specificity 0.748) and random forest (CV AUC 0.822, sensitivity 0.809, specificity 0.721). XGBoost detection of non-PD SWEDD matched 1–2 years curated diagnoses in 81.25% (13/16) cases. In both early PD/control and early PD/SWEDD analyses, and across all models, hyposmia was the single most important feature to classification; rapid eye movement behavior disorder (questionnaire) was the next most commonly high ranked feature. Alpha-synuclein was a feature of import to early PD/control but not early PD/SWEDD classification and the Epworth Sleepiness scale was antithetically important to the latter but not former.Interpretation: Non-motor clinical and biomarker variables enable high CV discrimination of early PD vs. controls but are less effective discriminating early PD from SWEDD.

  • Preprint Article
  • 10.2196/preprints.66363
Interpretable prediction of hospital mortality in bleeding critically ill patients based on machine learning and SHAP (Preprint)
  • Sep 11, 2024
  • Bingkui Ren + 8 more

BACKGROUND Hemorrhage is a prevalent and critical condition in the intensive care unit (ICU), marked by high incidence, elevated mortality rates, and substantial therapeutic challenges. Accurate prediction of mortality in patients with hemorrhage is essential for the development of personalized prevention and treatment strategies. Nevertheless, the implementation of effective predictive models in clinical practice remains limited, largely due to the current gap in robust and interpretable prediction tools. OBJECTIVE This study aimed to develop an interpretable model for predicting mortality risk in critically ill patients with hemorrhage in intensive care units (ICUs). The SHapley Additive exPlanation (SHAP) method was applied to interpret the extreme gradient boosting (XGBoost) model, allowing for the exploration of key prognostic factors in this patient population. METHODS In this retrospective cohort study, we developed and evaluated the performance of a predictive model using data from the eICU Collaborative Research Database (eICU-CRD). Data from the first 24 hours of each ICU admission were extracted, with the dataset randomly split into a training set (80%) and a validation set (20%). The predictive performance of the XGBoost model was compared to four other machine learning models using the area under the curve (AUC) as the metric. The SHapley Additive exPlanation (SHAP) method was employed to interpret the XGBoost model. Following initial validation, external validation was performed using data from a Chinese retrospective cohort, Refrain, which focuses on hemorrhage and coagulopathy in critically ill patients. RESULTS A total of 10306 eligible patients with hemorrhage were included in the final cohort for this study. The observed in-hospital mortality of patients with hemorrhage was 11.5%. Comparatively, the XGBoost model had the highest predictive performance among five models with an area under the curve (AUC=0.81) , whereas LR had the poorest generalization ability (AUC=0.726). The decision curve showed that the net benefit of the XGBoost model surpassed those of other machine learning models at 10%~30% threshold probabilities. The SHAP method reveals the top 15 predictors of hemorrhage according to the importance ranking, and the bilirubin level was recognized as the most important predictor variable. Additionally, in the external validation using the REFRAIN cohort, the XGBoost model demonstrated robust predictive performance with an AUC of 0.776. CONCLUSIONS The interpretable predictive model enhances the accuracy of mortality risk prediction in ICU patients with hemorrhage, enabling clinicians to devise more effective treatment plans and optimize resource allocation. Moreover, the interpretability framework increases model transparency, thereby facilitating clinicians' understanding and trust in the reliability of the predictive model.

  • Research Article
  • 10.3760/cma.j.cn112152-20220208-00082
Diagnostic values of conventional tumor markers and their combination with chest CT for patients with stageⅠA lung cancer
  • Nov 23, 2023
  • Zhonghua zhong liu za zhi [Chinese journal of oncology]
  • Qing Peng + 9 more

Objective: To investigate the diagnostic efficiency of conventional serum tumor markers and their combination with chest CT for stage ⅠA lung cancer. Methods: A total of 1 155 patients with stage ⅠA lung cancer and 200 patients with benign lung lesions (confirmed by surgery) treated at the Cancer Hospital, Chinese Academy of Medical Sciences from January 2016 to October 2020 were retrospectively enrolled in this study. Six conventional serum tumor markers [carcinoembryonic antigen (CEA), carbohydrate antigen 125 (CA125), squamous cell carcinoma associated antigen (SCCA), cytokeratin 19 fragment (CYFRA21-1), neuron-specific enolase (NSE), and gastrin-releasing peptide precursor (ProGRP)] and chest thin-slice CT were performed on all patients one month before surgery. Pathology was taken as the gold standard to analyze the difference of positivity rates of tumor markers between the lung cancer group and the benign group, the moderate/poor differentiation group and the well differentiation group, the adenocarcinoma group and the squamous cell carcinoma group, the lepidic and non-lepidic predominant adenocarcinoma groups, the solid nodule group and the subsolid nodule group based on thin-slice CT, and subgroups of ⅠA1 to ⅠA3 lung cancers. The diagnostic performance of tumor markers and tumor markers combined with chest CT was analyzed using the receiver operating characteristic curve. Results: The positivity rates of six serum tumor markers in the lung cancer group and the benign group were 2.32%-20.08% and 0-13.64%, respectively; only the SCCA positivity rate in the lung cancer group was higher than that in the benign group (10.81% and 0, P=0.022). There were no significant differences in the positivity rates of other serum tumor markers between the two groups (all P>0.05). The combined detection of six tumor markers showed that the positivity rate of the lung cancer group was higher than that of the benign group (40.93% and 18.18%, P=0.004), and the positivity rate of the adenocarcinoma group was lower than that of the squamous cell carcinoma group (35.66% and 47.41%, P=0.045). The positivity rates in the poorly differentiated group and moderately differentiated group were higher than that in the well differentiated group (46.48%, 43.75% and 22.73%, P=0.025). The positivity rate in the non-lepidic adenocarcinoma group was higher than that in lepidic adenocarcinoma group (39.51% and 21.74%, P=0.001). The positivity rate of subsolid nodules was lower than that of solid nodules (30.01% vs 58.71%, P=0.038), and the positivity rates of stageⅠA1, ⅠA2 and ⅠA3 lung cancers were 33.33%, 48.96% and 69.23%, respectively, showing an increasing trend (P=0.005). The sensitivity and specificity of the combined detection of six tumor markers in the diagnosis of stage ⅠA lung cancer were 74.00% and 56.30%, respectively, and the area under the curve (AUC) was 0.541. The sensitivity and specificity of the combined detection of six serum tumor markers with CT in the diagnosis of stage ⅠA lung cancer were 83.0% and 78.3%, respectively, and the AUC was 0.721. Conclusions: For stage ⅠA lung cancer, the positivity rates of commonly used clinical tumor markers are generally low. The combined detection of six markers can increase the positivity rate. The positivity rate of markers tends to be higher in poorly differentiated lung cancer, squamous cell carcinoma, or solid nodules. Tumor markers combined with thin-slice CT showed limited improvement in diagnostic efficiency for early lung cancer.

  • Research Article
  • Cite Count Icon 11
  • 10.1245/s10434-024-15762-3
Machine Learning for Early Discrimination Between Lung Cancer and Benign Nodules Using Routine Clinical and Laboratory Data.
  • Jul 16, 2024
  • Annals of surgical oncology
  • Wei Wei + 8 more

Lung cancer poses a global health threat necessitating early detection and precise staging for improved patient outcomes. This study focuses on developing and validating a machine learning-based risk model for early lung cancer screening and staging, using routine clinical data. Two medical center, observational, retrospective studies were conducted, involving 2312 lung cancer patients and 653 patients with benign nodules. Machine learning techniques, including differential analysis and feature selection, were employed to identify key factors for modeling. The study focused on variables such as nodule density, carcinoembryonic antigen (CEA), age, and lifestyle habits. The Logistic Regression model was utilized for early diagnoses, and the XGBoost model was utilized for staging based on selected features. For early diagnoses, the Logistic Regression model achieved an area under the curve (AUC) of 0.716 (95% confidence interval [CI] 0.607-0.826), with 0.703 sensitivity and 0.654 specificity. The XGBoost model excelled in distinguishing late-stage from early-stage lung cancer, exhibiting an AUC of 0.913 (95% CI 0.862-0.963), with 0.909 sensitivity and 0.814 specificity. These findings highlight the model's potential for enhancing diagnostic accuracy and staging in lung cancer. This study introduces a novel machine learning-based risk model for early lung cancer screening and staging, leveraging routine clinical information and laboratory data. The model shows promise in enhancing accuracy, mitigating overdiagnosis, and improving patient outcomes.

  • Research Article
  • Cite Count Icon 4
  • 10.62347/whuq1208
Establishment and validation of a prognostic risk early-warning model for retinoblastoma based on XGBoost.
  • Jan 1, 2025
  • American journal of cancer research
  • Feng Wang

Retinoblastoma (RB) is the most common intraocular malignancy in children, and early detection and treatment are crucial for improving patient outcomes. Conventional treatments, such as enucleation and radiotherapy, have limitations in fully addressing prognosis. This study aimed to establish and validate an early-warning prognostic model for RB based on the XGBoost algorithm to improve the prediction accuracy of the 5-year survival rate in children. A retrospective analysis was conducted on 320 children with RB treated at Changzhi People's Hospital between February 2012 and April 2019. The patients were randomly divided into a training group (n=224) and a validation group (n=96). Clinical data, including age, gender, tumor characteristics, and tumor marker levels, were collected. Prognostic factors were analyzed using XGBoost and Cox regression models, and model performance was evaluated using various statistical methods. No significant differences were observed in baseline data between the two sets (P>0.05). Cox regression analysis identified tumor diameter (P=0.032), IIRC stage (P<0.001), and NSE (P=0.016) as independent prognostic factors. The XGBoost model achieved an area under the curve (AUC) of 0.951 in the training group, significantly higher than the Cox model (P=0.001), while in the validation group, the XGBoost model's AUC was 0.902, with no significant difference compared to the Cox model (P=0.117). The XGBoost model demonstrated high accuracy and clinical utility in predicting the 5-year survival of children with RB. Decision curve analysis (DCA) and calibration curves further confirmed that the XGBoost model offers higher clinical net benefits and superior calibration ability across various thresholds.

  • Research Article
  • Cite Count Icon 69
  • 10.1097/jto.0b013e3181c1274f
Endoscopic and Endobronchial Ultrasonography According to the Proposed Lymph Node Map Definition in the Seventh Edition of the Tumor, Node, Metastasis Classification for Lung Cancer
  • Dec 1, 2009
  • Journal of Thoracic Oncology
  • Kurt G Tournoy + 4 more

Endoscopic and Endobronchial Ultrasonography According to the Proposed Lymph Node Map Definition in the Seventh Edition of the Tumor, Node, Metastasis Classification for Lung Cancer

  • Research Article
  • Cite Count Icon 113
  • 10.1097/jto.0b013e3181a52370
The IASLC Lung Cancer Staging Project: Data Elements for the Prospective Project
  • Jun 1, 2009
  • Journal of Thoracic Oncology
  • Dorothy J Giroux + 10 more

The IASLC Lung Cancer Staging Project: Data Elements for the Prospective Project

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.1038/s41598-024-57711-w
Comparison of different machine learning classification models for predicting deep vein thrombosis in lower extremity fractures
  • Mar 22, 2024
  • Scientific Reports
  • Conghui Wei + 7 more

Deep vein thrombosis (DVT) is a common complication in patients with lower extremity fractures. Once it occurs, it will seriously affect the quality of life and postoperative recovery of patients. Therefore, early prediction and prevention of DVT can effectively improve the prognosis of patients. This study constructed different machine learning models to explore their effectiveness in predicting DVT. Five prediction models were applied to the study, including Extreme Gradient Boosting (XGBoost) model, Logistic Regression (LR) model, RandomForest (RF) model, Multilayer Perceptron (MLP) model, and Support Vector Machine (SVM) model. Afterwards, the performance of the obtained prediction models was evaluated by area under the curve (AUC), accuracy, sensitivity, specificity, F1 score, and Kappa. The prediction performances of the models based on machine learning are as follows: XGBoost model (AUC = 0.979, accuracy = 0.931), LR model (AUC = 0.821, accuracy = 0.758), RF model (AUC = 0.970, accuracy = 0.921), MLP model (AUC = 0.830, accuracy = 0.756), SVM model (AUC = 0.713, accuracy = 0.661). On our data set, the XGBoost model has the best performance. However, the model still needs external verification research before clinical application.

  • Research Article
  • 10.1186/s12911-025-03101-9
Interpretable prediction of hospital mortality in bleeding critically ill patients based on machine learning and SHAP.
  • Jul 15, 2025
  • BMC medical informatics and decision making
  • Bingkui Ren + 8 more

Hemorrhage is a prevalent and critical condition in the intensive care unit (ICU), characterized by high incidence, elevated mortality rates, and substantial therapeutic challenges. Accurate prediction of mortality in patients with hemorrhage is essential for developing personalized prevention and treatment strategies. Nevertheless, the implementation of effective predictive models in clinical practice remains limited, primarily due to the lack of robust and interpretable tools. This study aimed to develop an interpretable model for predicting mortality risk in critically ill patients with hemorrhage admitted to ICUs. The SHapley Additive exPlanations (SHAP) method was applied to interpret the eXtreme Gradient Boosting (XGBoost)model, identifying key prognostic factors in this population. In this retrospective cohort study, we derived data from the eICU Collaborative Research Database (eICU-CRD) to develop and evaluate a predictive model. ​Clinical data from the first 24h of ICU admission were extracted, and the dataset was randomly split into training (80%) and validation (20%) sets. Model performance was compared​ to four other machine learning algorithms using the area under the curve (AUC). ​SHAP was utilized to interpret the XGBoost model. External validation was subsequently performed using data from the ​Chinese REFRAIN cohort, which focuses on hemorrhage and coagulopathy in critically ill patients.​​. The study protocol was retrospectively registered in the Chinese Clinical Trial Registry (ChiCTR) on December 17, 2024 (Registration number ChiCTR2400094140). A total of 10,306 eligible patients with hemorrhage were included. The observed in-hospital mortality rate was 11.5%.Among the five models compared, XGBoost demonstrated the highest predictive performance (AUC = 0.81), whereas logistic regression (LR) showed the lowest generalizability(AUC = 0.726). Decision curve analysis revealed that the XGBoost model provided a greater net benefit than other models at threshold probabilities of 10-30%. SHAP analysis identified the top 15 predictors of mortality, with bilirubin level ranked as the most influential variable.​​ External validation using the REFRAIN cohort confirmed the robustness of model(AUC = 0.776). The interpretable predictive model improves mortality risk stratification in ICU patients with hemorrhage, supporting clinicians in optimizing treatment plans and resource allocation. Enhanced model transparency through SHAP explanations may facilitate clinical adoption by improving trust in model reliability.

  • Research Article
  • Cite Count Icon 22
  • 10.1002/cam4.4800
Prediction of lung cancer risk in Chinese population with genetic-environment factor using extreme gradient boosting.
  • May 2, 2022
  • Cancer Medicine
  • Yutao Li + 12 more

Detecting early-stage lung cancer is critical to reduce the lung cancer mortality rate; however, existing models based on germline variants perform poorly, and new models are needed. This study aimed to use extreme gradient boosting to develop a predictive model for the early diagnosis of lung cancer in a multicenter case-control study. A total of 974 cases and 1005 controls in Shanghai and Taizhou were recruited, and 61 single nucleotide polymorphisms (SNPs) were genotyped. Multivariate logistic regression was used to calculate the association between signal SNPs and lung cancer risk. Logistic regression (LR) and extreme gradient boosting (XGBoost) algorithms, a large-scale machine learning algorithm, were adopted to build the lung cancer risk model. In both models, 10-fold cross-validation was performed, and model predictive performance was evaluated by the area under the curve (AUC). After FDR adjustment, TYMS rs3819102 and BAG6 rs1077393 were significantly associated with lung cancer risk (p < 0.05). For lung cancer risk prediction, the model predicted only with epidemiology attained an AUC of 0.703 for LR and 0.744 for XGBoost. Compared with the LR model predicted only with epidemiology, further adding SNPs and applying XGBoost increased the AUC to 0.759 (p < 0.001) in the XGBoost model. BAG6 rs1077393 was the most important predictor among all SNPs in the lung cancer prediction XGBoost model, followed by TERT rs2735845 and CAMKK1 rs7214723. Further stratification in lung adenocarcinoma (ADC) showed a significantly elevated performance from 0.639 to 0.699 (p=0.009) when applying XGBoost and adding SNPs to the model, while the best model for lung squamous cell carcinoma (SCC) prediction was the LR model predicted with epidemiology and SNPs (AUC=0.833), compared with the XGBoost model (AUC=0.816). Our lung cancer risk prediction models in the Chinese population have a strong predictive ability, especially for SCC. Adding SNPs and applying the XGBoost algorithm to the epidemiologic-based logistic regression risk prediction model significantly improves model performance.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.