Articles published on Gradient boosting
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
19505 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.jad.2025.120679
- Feb 15, 2026
- Journal of affective disorders
- Benojir Ahammed + 3 more
Mental health at risk: Predicting psychological distress in Australian youth through machine learning models.
- New
- Research Article
- 10.1159/000550910
- Feb 14, 2026
- Medical principles and practice : international journal of the Kuwait University, Health Science Centre
- Wenjun Zhu + 5 more
Sepsis-associated liver injury (SALI) occurs in approximately 40% of sepsis cases and is linked to high mortality, a challenge that may stem from the absence of effective prognostic models. We developed a machine learning (ML)-based prognostic model for SALI using conventional biomarkers to guide precise clinical interventions and reduce mortality. We retrospectively analyzed 307 SALI patients (2010-2024), stratified into favorable (n=139) and poor (n=168) prognosis groups by post-treatment progression. The cohort was randomly split into a training set (80%) and a validation set (20%). The routine biomarkers included hematological indices, liver/renal function parameters, and coagulation profiles. Feature selection used LASSO regression. Nine machine learning algorithms constructed prognostic models: eXtreme Gradient Boosting, Logistic Regression, Light Gradient Boosting Machine, Random Forest, Adaptive Boosting, Gradient Boosting Decision Tree, Gaussian Naive Bayes, and Multilayer Perceptron. Model interpretability was evaluated via the SHapley Additive exPlanation (SHAP) algorithm. An independent cohort of 37 SALI patients was used for external validation. Key parameters influencing SALI prognosis were red blood cell distribution width-coefficient of variation (RDW-CV), anion gap (AG), and high-sensitivity cardiac troponin (hs-cTn). Among the nine models, the Random Forest prognostic model performed best, with an area under the curve (AUC) of 0.816 in the validation set and 0.781 in the external validation. The Random Forest model developed in this study can provide some guidance for clinical decision-making in SALI patients, but further validation is still required and should only be implemented in clinical practice after further research.
- New
- Research Article
- 10.1093/gerona/glag031
- Feb 14, 2026
- The journals of gerontology. Series A, Biological sciences and medical sciences
- Eric T Klopack + 1 more
Research suggests aging is a coordinated physiological decline occurring in multiple systems and at multiple biological levels. However, it is largely unknown how general biological aging and specific systemic aging co-occur and influence one another to affect health outcomes. There is also emerging interest in understanding how social exposures may differentially accelerate decline in individual physiological systems. We utilize data from the Health and Retirement Study, a nationally representative sample of about 4000 US adults over age 55. We used eXtreme Gradient Boosting (xgboost) in a training subsample to create system-specific mortality risk scores based on sets of biomarkers representing biological systems (e.g., brain and nervous system, adaptive immune system, cardiovascular system, renal system) as well as general multisystem aging. Results suggest that the effects of most biological systems may be well captured by one or a small number of biomarkers and that female sex appears to be a protective or risk factor depending on specific biological system. The importance of studying both general and system-specific aging is discussed.
- New
- Research Article
- 10.2196/80156
- Feb 13, 2026
- JMIR medical informatics
- Haiquan Li + 7 more
Coal workers' pneumoconiosis (CWP) is the most prevalent occupational disease that causes irreversible lung damage. Early prediction of CWP is the key to blocking the irreversible process of pulmonary fibrosis. The prediction of CWP based on imaging data and biomarker detection is constrained due to high cost and poor convenience. The study aimed to use easily detectable clinical data to construct a prediction model for CWP through machine learning (ML) methods. A prediction framework was established using a moderate-sized dataset and multidimensional clinical features, including occupational information, lung function parameters, and blood indicators. Six ML algorithms (light gradient boosting machine, random forest, extreme gradient boosting, categorical boosting, support vector machine, and logistic regression) were trained and evaluated using a stratified 5-fold cross-validation and a held-out test set. Hyperparameter optimization was performed using a unified Optuna-based strategy to ensure fair comparison across models. Model interpretability was assessed using Shapley Additive Explanation on top-performing models. In addition, an ablation analysis was conducted by retraining models after excluding job type to assess the independent predictive value of clinical biomarkers. All 6 models achieved consistently high predictive performance, and the differences among the top-performing models were small on the test set. After Optuna-based optimization, light gradient boosting machine and categorical boosting achieved high test-set area under curve values (0.974 and 0.975, respectively), while extreme gradient boosting achieved the highest recall (0.926) and F1-score (0.952). Compared with the baseline models, hyperparameter optimization resulted in only minor performance changes, indicating robust prediction under the current feature set and evaluation protocol. Shapley Additive Explanation analysis consistently identified age, forced expiratory volume/forced vital capacity, and platelet count as key contributors to CWP risk prediction. The ablation analysis further showed that model performance remained strong after removing job type, supporting the independent predictive value of clinical features beyond occupational history. The research results have confirmed the potential of combining simple multidimensional features with ML algorithms for predicting CWP and provided new ideas for early diagnosis and intervention of patients with CWP.
- New
- Research Article
- 10.1126/sciadv.aeb1323
- Feb 13, 2026
- Science advances
- Yaping Wen + 2 more
Optimizing organic photovoltaic (OPV) performance requires navigating the high-dimensional, interdependent processing parameters governing bulk heterojunction morphology. To address this, we have constructed a standardized database integrating donor/acceptor pairs, nine key fabrication parameters, and device efficiencies, consolidating over a decade of experimental results. Leveraging this resource, we developed a three-tiered machine learning framework using gradient boosting regression trees. The strategy progresses from single-parameter baseline models to stage-combined models that capture intraprocess synergies, culminating in a global nine-parameter optimization model. This final model achieves a Pearson correlation of >0.9 and a success rate of >80% in identifying optimal multiparameter configurations. Validation on 78 external systems, each containing a previously unseen donor or acceptor, demonstrates robust generalization with >75% accuracy in predicting the optimal or secondary condition for individual parameters. This work establishes a practical, data-driven framework for accelerating the rational optimization of OPV photoactive layers.
- New
- Research Article
- 10.1080/14796694.2026.2630630
- Feb 13, 2026
- Future oncology (London, England)
- Bingying Li + 6 more
To explore the role of multi-sequence magnetic resonance imaging (MRI) images in preoperative prediction of lymph node metastasis in laryngeal squamous cell carcinoma (LSCC). Patients with LSCC undergoing open surgery and lymph node dissection were enrolled (n = 224 training, n = 96 testing). Radiomic features (n = 2394) were extracted from T1-enhanced and T2-weighted images. Features were screened using least absolute shrinkage and selection operator (LASSO) regression, and the best-performing classification model was identified among Logistic Regression, Random Forest, Extreme Gradient Boosting, and Light Gradient Boosting Machine. An imaging biomarker-based nomogram integrating radiomic and clinical features was developed via logistic regression. LASSO regression identified 14 stable features (6 from T1-enhanced images, 8 from T2-weighted). The Random Forest model showed the best radiomics-only performance (area under the receiver operating characteristic curve [AUC]: 0.877 training; 0.875 testing). The combined clinical - radiomics nomogram achieved higher discrimination (AUC: 0.942 training; 0.908 testing), outperforming standalone clinical or radiomic models. The radiomic-clinical nomogram enhances preoperative prediction of cervical lymph node metastasis in LSCC, offering the potential to optimize clinical decision-making.
- New
- Research Article
- 10.3390/su18041944
- Feb 13, 2026
- Sustainability
- Tsolmon Sodnomdavaa
Gross Primary Productivity (GPP) in grassland ecosystems is a fundamental eco-biophysical indicator for assessing carbon cycling, grazing capacity, and ecosystem responses to climatic stress. However, robust estimation of GPP in arid and semi-arid rangelands remains challenging because of pronounced spatial heterogeneity, strong climate variability, and inherent uncertainties associated with remotely sensed observations. Together, these factors constrain both modeling performance and out-of-sample generalization beyond the training domain. In this dryland grassland context, this study compares the performance of machine learning (ML) models for grassland GPP proxy-based characterization, downscaling, and predictive agreement using a multivariate dataset that integrates Sentinel-2-derived spectral and phenological features, a Moderate-Resolution Imaging Spectroradiometer (MODIS)-derived GPP proxy, and complementary climatic and geographic information. Pixel-level observations spanning multiple years are analyzed, with ordinary linear regression used as a baseline benchmark and ensemble decision-tree models, including Random Forest, Gradient Boosting, and Histogram-based Gradient Boosting (HGB), compared. Instead of relying solely on random cross-validation, model performance is systematically assessed using a combination of spatially structured validation and a leave-one-year-out scheme to explicitly examine spatial and temporal generalization. The results indicate that ensemble tree-based models outperform linear approaches, with the HGB model showing the strongest agreement with the MODIS-derived GPP proxy (R2 = 0.95, RMSE = 0.035 on the test set) and maintaining stable performance across spatial and temporal validations (R2 = 0.86–0.96 across years). Taken together, the findings demonstrate that integrating multi-source remote sensing data with climatic information within a rigorous validation framework enables a more reliable assessment of model generalization and gap-filling consistency with respect to a remote-sensing-based proxy target, rather than an absolute validation against ground-based measurements, thereby supporting sustainability-relevant monitoring of arid grassland ecosystems.
- New
- Research Article
- 10.3389/frai.2026.1690664
- Feb 13, 2026
- Frontiers in Artificial Intelligence
- Suraiya Akhter + 1 more
Cardiovascular disease (CVD) remains the foremost contributor to global illness and death, underscoring the critical need for effective tools that can predict risk at early stages to support preventive care and timely clinical decisions. With the growing complexity of healthcare data, machine learning has shown considerable promise in extracting insights that enhance medical decision-making. Nonetheless, the effectiveness and clarity of machine learning models largely rely on the relevance and quality of input features. In this work, we explored and compared four feature-selection strategies—Pearson correlation + Chi-squared test, Alternating Decision Tree (ADT)-based scoring, Cross-Validated Feature Evaluation (CVFE), and Hypergraph-Based Feature Evaluation (HFE)—to identify the most predictive factors for CVD risk. Our analysis utilized data from the National Health and Nutrition Examination Survey (NHANES), administered by the National Center for Health Statistics under the Centers for Disease Control and Prevention (CDC), encompassing demographic, clinical, laboratory, and survey data collected across the U.S. from August 2021 through August 2023. Distinct sets of features obtained through these selection techniques were used to develop random forest (RF), support vector machine (SVM), and eXtreme Gradient Boosting (XGBoost) models, which were then assessed for predictive effectiveness. To improve clarity and understanding of model decision-making, SHapley Additive exPlanations (SHAP) was used to interpret feature contributions in the top-performing model. Among the evaluated methods, the HFE approach combined with SVM achieved the highest overall accuracy (82.84%) and AUC (0.9027), outperforming both classical and alternative strategies. The most influential predictors included age, total cholesterol, history of high blood pressure, use of cholesterol-lowering medication, recent prescription medication use, lifetime smoking history, family income-to-poverty ratio, gender, educational attainment, and red cell distribution width. The web application, accessible at https://shiny.tricities.wsu.edu/cvdr-prediction/ , presents predictive results, probability scores, and SHAP plots generated from the model trained using the feature set selected by the hypergraph-based approach. This study highlights the importance of strategic feature selection in refining predictive accuracy and interpretability, offering a practical data-driven approach that could aid clinicians in evaluating cardiovascular risk and tailoring preventive care.
- New
- Research Article
- 10.1017/neu.2026.10063
- Feb 13, 2026
- Acta neuropsychiatrica
- Lasse Hansen + 4 more
Electroconvulsive therapy (ECT) is an effective treatment of severe manifestations of mental illness. Since delay in initiation of ECT can have detrimental effects, prediction of the need for ECT could improve outcomes via more timely treatment initiation. Therefore, this study aimed to predict the need for ECT following admission to a psychiatric hospital. This study was based on electronic health record (EHR) data from routine clinical practice. Adult patients admitted to a hospital within the Psychiatric Services of the Central Denmark Region between January 2013 and November 2021 were included in the study. The outcome was initiation of ECT >7 days (to not include patients admitted for planned ECT) and ≤67 days after admission. The data was randomly split into an 85% training set and a 15% test set. On the 7th day of the inpatient stay, machine learning models (extreme gradient boosting) were trained to predict initiation of ECT and subsequently tested on the test set. The cohort consisted of 41,610 patients with 164,961 admissions. In the held out test set, the trained model predicted ECT initiation with an area under the receiver operating characteristic curve of 0.94, 47% sensitivity, 98% specificity, positive predictive value of 24% and negative predictive value of 99%. The top predictors were the highest suicide assessment score and mean Brøset violence checklist score in the preceding three months. EHR data from routine clinical practice may be used to predict need for ECT. This may lead to more timely treatment initiation.
- New
- Research Article
- 10.1785/0120250211
- Feb 13, 2026
- Bulletin of the Seismological Society of America
- Dong-Hoon Sheen + 2 more
ABSTRACT The accurate determination of earthquake focal depth is critical for understanding regional seismic processes, characterizing seismogenic behavior, and assessing seismic hazard. However, precise focal depth determination remains a significant challenge owing to theoretical limitations, sparse station geometry, uncertainties in velocity models, and errors in arrival-time picking. This study evaluated the uncertainty of earthquake focal depths in the southern Korean Peninsula using seismic station geometry and investigated regional seismogenic characteristics. We selected 304 earthquakes with magnitudes between 2.0 and 4.9 recorded from 2018 to 2022, for which P- and S-wave arrivals were manually picked from dense seismic stations. To quantify uncertainty, we conducted Monte Carlo simulations using synthetic datasets generated with multiple velocity models, random sampling of station locations, and varied initial focal depths. Gradient boosting analysis identified the minimum distance to a station and the number of near-epicentral P and S arrivals as the dominant factors reducing focal depth errors. We propose geometry-based criteria that allow approximately 95% of local crustal events to be located with focal depth errors within 5 km and epicentral errors within 2 km: (1) at least seven stations within 100 km, including two within 50 km and one within 10 km of the epicenter; (2) at least one S wave within 50 km; and (3) primary and secondary azimuthal gaps less than 160° and 220°, respectively. These criteria are region-specific, and further validation is required for application elsewhere. Applying these constraints revealed a bimodal focal depth distribution in the southern Korean Peninsula, with primary concentrations at 5–12 and 12–22 km. Shallow earthquakes occur widely across the southern Korean Peninsula, whereas deeper events are preferentially located near the boundaries of the Okcheon fold belt and in the Gyeongsang basin. Our findings also highlight the need for caution when interpreting offshore focal depths, which may be biased by insufficient station geometry.
- New
- Research Article
- 10.1088/2053-1591/ae45f4
- Feb 13, 2026
- Materials Research Express
- Ajit Mohan Gaonkar + 1 more
Abstract Within the United Nations Sustainable Development Goals (SDGs), Goal 9 calls for industries to be made sustainable through increased resource-use efficiency and adoption of environmentally sound processes, while Goal 12 recognises that sustainable manufacturing promotes responsible production and reduces waste. Building on this motivation, the present study demonstrates the feasibility of using unprocessed beach sand as an abrasive for precision through-hole drilling of mild steel in Abrasive Jet Machining (AJM), supported by predictive modelling and multi-objective optimisation. A custom AJM system was developed, and experiments were conducted using a Taguchi L27 orthogonal array to investigate the effects of control parameters on responses: Material Removal Rate (MRR) and Kerf Taper Angle (KTA). Experimental runs revealed stable MRR and reduced KTA, outcomes which were not previously reported for ductile steel using natural abrasives. Random Forest and Extreme Gradient Boosting (XGBoost) models achieved high predictive accuracy (R² > 0.95 for MRR, > 0.80 for KTA), while independent multi-criteria decision-making (MCDM) methods converged on the same optimal parameter set. Scanning electron microscopy (SEM) images at these settings showed sharper edges and improved surface integrity. By replacing conventional abrasives such as silicon carbide and aluminium oxide with locally available beach sand, the work addresses resource efficiency and waste reduction. The integration of sustainable abrasive selection, robust predictive modelling, and decision-driven optimisation may thus present a viable pathway towards greener, high-precision machining of ductile metals.
- New
- Research Article
- 10.3390/agriengineering8020065
- Feb 12, 2026
- AgriEngineering
- Miguel Tueros + 9 more
The cultivation of potatoes is essential for rural food security, and the use of Unmanned Aerial Vehicle Red-Green-Blue (UAV-RGB) imagery allows for precise and cost-effective estimation of yield and identification of varieties, overcoming the limitations of manual assessment. We evaluated four INIA varieties (Bicentenario, Canchán, Shulay and Tahuaqueña) by integrating agronomic measurements (height, number and weight of tubers, leaf health) with color and textural indices derived from RGB orthomosaics. Yield prediction was modeled using Random Forest (RF) and Gradient Boosting (GB); varietal identification was approached with (i) a Convolutional Neural Network (CNN) that classifies RGB images and (ii) classical models such as Random Forest, Support Vector Machines (SVMs), K-Nearest Neighbors (KNNs), Decision Trees and Logistic Regression trained on EfficientNetB0 embeddings. The results showed significant genotypic differences in yield (p < 0.001): Tahuaqueña 13.86 ± 0.27 t ha−1 and Bicentenario 6.65 ± 0.27 t ha−1. The number of tubers (r = 0.52) and plant height (r = 0.23) correlated with yield; RGB indices showed low correlations (r < 0.3) and high redundancy (r > 0.9). RF achieved a better fit (Coefficient of determination, R2 = 0.54; Root Mean Square Error, RMSE = 2.72 t ha−1), excelling in stolon development (R2 = 0.66) and losing precision in maturation due to foliar senescence. In classification, the CNN and RF on embeddings achieved F1-macro ≈ 0.69 and 0.66 (Receiver Operating Characteristic—Area Under the Curve, ROC AUC RF = 0.89), with better identification of Bicentenario and Shulay. We conclude that UAV-RGB is a cost-effective alternative for phenotypic monitoring and varietal selection in high Andean contexts. These findings support the integration of UAV-RGB imagery into breeding and monitoring pipelines in resource-limited Andean systems.
- New
- Research Article
- 10.1080/15435075.2026.2628952
- Feb 12, 2026
- International Journal of Green Energy
- Gürkan Aydemir + 1 more
ABSTRACT Offshore wind power is a critical component of the global transition to renewable energy. However, the accuracy of power curve prediction, essential for both resource assessment and operational monitoring, is significantly hindered by the unique challenges of the marine environment, such as volatile wind conditions and complex nonlinear turbine dynamics. To overcome these limitations, this study presents a novel framework with a twofold methodological contribution. First, a meticulously optimized eXtreme Gradient Boosting (XGBoost) model is developed, establishing a new state-of-the-art performance benchmark for predicting offshore wind power using only standard environmental sensor data. Second, this high-performing model is leveraged to conduct a novel comparative analysis that reveals the fundamentally different feature dependencies of offshore versus inland turbines. This analysis uncovers the distinct environmental drivers crucial for context-specific modeling, an insight previously unexplored in the literature. Validation against real-world data demonstrates the model’s superiority; the proposed XGBoost approach achieved a Root Mean Square Error (RMSE) of 0.07422 for offshore prediction. This represents a significant performance improvement, reducing the error by 4.7% compared to the next-best model, k-Nearest Neighbor regression (kNN, RMSE 0.0777), and by up to 39% compared to the traditional Binning method (RMSE 0.12117). Consequently, the engineering value of this work lies in its dual achievement: it significantly improves the accuracy of power curve modeling for crucial industry tasks while accomplishing this with low-cost, readily available data. This positions the proposed approach as a practical and economically viable tool for enhancing the operational efficiency and reliability of offshore wind farms.
- New
- Research Article
- 10.1038/s41598-026-36424-2
- Feb 12, 2026
- Scientific Reports
- Maytha Al-Ali + 5 more
Employee attrition poses significant challenges to organizations, impacting productivity, morale, and financial stability. Predicting attrition and understanding its underlying drivers are critical for implementing effective retention strategies. In this study, we propose a comprehensive framework that utilizes advanced machine learning techniques to predict employee attrition and job change likelihood. The framework integrates robust preprocessing pipelines, state-of-the-art predictive models, and explainability tools such as SHAP (SHapley Additive exPlanations) to ensure transparency and fairness in HR analytics. By addressing key challenges such as class imbalance, feature selection, and model interpretability, our approach provides actionable insights for proactive talent management. We evaluate the framework on multiple datasets (including the IBM HR Analytics Employee Attrition & Performance dataset and the HR Analytics: Job Change of Data Scientists dataset), achieving near-optimal performance metrics across diverse scenarios. Notably, the Adaptive Boosting (AB) and Histogram Gradient Boosting (HGB) models demonstrate superior performance, with high Precision, Recall, F1-score, and Accuracy. Global and local interpretability analyses using SHAP visualizations reveal critical predictors of attrition, such as OverTime, JobLevel, and JobSatisfaction, enabling targeted interventions. The results underscore the framework’s adaptability, scalability, and potential for real-time deployment in organizational settings. This study contributes to advancing HR analytics by bridging gaps in predictive accuracy, interpretability, and generalizability; offering practical solutions for mitigating employee turnover and safeguarding human capital investments.
- New
- Research Article
- 10.1142/s0218213026400038
- Feb 11, 2026
- International Journal on Artificial Intelligence Tools
- Alina Lazar + 2 more
The goal of this study was to evaluate the performance of traditional gradient boosting (GB) and neural network models on diverse tabular datasets that differ in scale, class balance, and feature composition (numerical, categorical, or mixed). We focused on six representative datasets: adult census income, bank marketing, credit card fraud, breast cancer diagnosis, diabetes, and in-vehicle coupon recommendation, each with distinct challenges related to dimensionality, sample size, and heterogeneity. We benchmark the predictive performance of XGBoost and LightGBM (gradient boosting models) against Multilayer Perceptrons (MLP), Tabular Transformers, and tabular prior-data fitted network (TabPFN), using metrics such as accuracy, F1 score, ROC-AUC, and log loss. To ensure transparency and interpretability, we applied SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanation (LIME) to all models and evaluated the explanation quality using stability, fidelity, and consistency criteria. Our findings confirm that gradient boosting models consistently achieve the best balance of performance, calibration, and interpretability across heterogeneous and imbalanced datasets. SHAP-based insights show that gradient boosting (GB) models provide more stable and interpretable feature attributions, making them well suited for high-stakes domains such as finance and healthcare. These results emphasize the practical advantages of gradient boosting methods for structured data tasks and highlight the interpretability limitations of deep learning models when applied to tabular datasets. Future work will explore hybrid architectures and pretraining strategies to close this performance gap.
- New
- Research Article
- 10.3390/rs18040563
- Feb 11, 2026
- Remote Sensing
- Mayra Perez-Flores + 9 more
To improve crop yields and incomes, farmers consistently adapt their practices to climate and market fluctuations, resulting in highly variable crop field distribution and coverage in space and time. As these dynamics illustrate farmers’ challenges, up-to-date crop-type mapping is essential for understanding farmers’ needs and supporting their adoption of sustainable practices. With global coverage and frequent temporal observations, remote sensing data are generally integrated into machine learning models to monitor crop dynamics. Unlike physical-based models that rely on straightforward use, implementing machine learning models requires extensive user interaction. In this context, this study assesses how sensitive the models’ outputs are to feature selection and hyperparameter tuning, as both processes rely on user judgment. To achieve this, Sentinel-1 (S1) and Sentinel-2 (S2) features are integrated into five distinct models (Random Forest (RF), Support Vector Machine (SVM), Light Gradient Boosting (LGB), Histogram-based Gradient Boosting (HGB), and Extreme Gradient Boosting (XGB)), considering several features selection (Variance Inflation Factor (VIF) and Sequential Feature Selector (SFS)) and hyperparameter tuning (Grid-Search) setup. Results show that the preprocess modeling feature selection (VIF) discards the features that the wrapped method (SFS) keeps, resulting in less reliable crop-type mapping. Additionally, hyperparameter tuning appears to be sensitive to the input features, and considering it after any feature selection improved the crop-type mapping. In this context a three-step nested modeling setup, including first hyperparameter tuning, followed by a wrapped feature selection (SFS) and additional hyperparameter tuning, leads to the most reliable model outputs. For the study region, LGB and XGB (SVM) are the most (least) suitable models for crop-type mapping, and model reliability improves when integrating S1 and S2 features rather than considering S1 or S2 alone. Finally, crop-type maps are derived across different regions and time periods to highlight the benefits of the proposed method for monitoring crop dynamics in space and time.
- New
- Research Article
- 10.4108/airo.10265
- Feb 11, 2026
- EAI Endorsed Transactions on AI and Robotics
- Jahid Hassan Akash + 5 more
The identification of tumor-homing peptides (THPs) plays a pivotal role in the development of targeted cancer therapies and precision medicine. Current THP identification methods still suffer from limited feature representation, moderate predictive performance, and insufficient generalization, highlighting the need for more robust ensemble frameworks. In this study, we propose STHPP, an innovative stacking-based ensemble machine learning approach designed to improve the accuracy and reliability of THP discovery. Two benchmark datasets, referred to as the "main" and "small" datasets of Shoombuatong were collected, merged, and pre-processed in preparation to create a large dataset and then split for training and testing. The STHPP model applies a two-layer ensemble architecture: first layer that aggregates three heterogenous baseline classifiers, Random Forest (RF), Light Gradient Boosting Machine (LightGBM), Extreme Gradient Boosting (XGBoost), and then second layer applies CatBoost as a meta-classifier for post-processing predictive results of the base models. The two-layer architecture uses model diversity and concepts in ensemble learning to enhance generalization performance. The STHPP framework proposed got outstanding performance with accuracy 0.98, precision 0.97, sensitivity 0.99, specificity 0.97, and a Matthews Correlation Coefficient (MCC) of 0.98. These are better than the performances of current state-of-the-art approaches, which illustrates the effectiveness of using the stacking strategy in complicated peptide classification problems. The finding showcases the potential of STHPP as a strong and scalable computational platform for propelling peptide-based drug discovery research and targeted oncology.
- New
- Research Article
- 10.1007/s12094-026-04230-x
- Feb 11, 2026
- Clinical & translational oncology : official publication of the Federation of Spanish Oncology Societies and of the National Cancer Institute of Mexico
- Siyun Lu + 5 more
Endometrial cancer (EC) is a common gynecological tumor. Insulin resistance (IR) increases the risk of EC. However, the common molecular basis between the two remains unclear. This study aims to screen the common differential expression genes (DEGs) between the two diseases and construct a prognostic risk model. We obtained gene expression profiles and clinical information of patients with IR and EC from GEO and TCGA datasets. We performed differential analysis to discover the shared DEGs between IR and EC. Subsequently, the interactions among overlapping DEGs, along with their biological functions and genetic mutations in EC, were comprehensively analyzed via protein-protein interaction (PPI) network, function enrichment analyses, and genetic mutation analyses. Then, machine-learning algorithms were employed to figure out genes significantly associated with survival. For clinical application, we constructed a prognostic risk model and also compared tumor-infiltrating immune cells (TIICs) and genetic mutation between high- and low-risk groups. Finally, we screened one of the most important markers in the prognostic signature to investigate its expression-prognosis pattern, biological function, and underlying mechanism. Our analysis identified 20 co-upregulated genes and 32 co-downregulated genes of IR and EC. In addition, the two subnetworks and the top 20 top genes were obtained through PPI analysis, while the construction of extracellular matrix and immune response were the most enriched functions of DEGs. Filtered by random forest, gradient boosting machine, and extreme gradient boosting, six upregulated markers (ACTL8, WNT7A, CTSV, MMP9, CNIH2, and PLAUR) and four downregulated markers (COL6A6, MYOC, PHLDB1, and FIBIN), were defined as the characteristic genes for the prognosis of EC patients. The risk prediction model constructed by these ten genes had good predictive value in prognosis of EC patients and was related to immune regulation and genetic mutation. ACTL8 was further studied as the most significant marker among 10-gene signature. The correlation between the upregulation of ACTL8 and the poor prognosis of EC patients suggested its carcinogenic effect, which was correlated to its regulation of cilium movement. Our findings suggest that there are common molecular profiles between IR and EC. IR-related prognostic model represents an excellent prognosis predictor and immune-related biomarker, which can be applied to risk stratification and precise treatment of EC patients with IR.
- New
- Research Article
- 10.3390/math14040626
- Feb 11, 2026
- Mathematics
- Hongwen Gu + 1 more
Understanding and preventing student dropout presents a decision-critical modeling problem involving heterogeneous variables, nonlinear relationships, and the need for transparent inference. This study addresses the prediction of undergraduate academic outcomes, including Graduation, Enrolled, and Dropout, by proposing a efficientand interpretable machine learning framework that explicitly balances predictive performance, feature efficiency, and algorithmic explainability. The empirical analysis relies on a dataset of 4424 student records across 17 undergraduate programs from the Polytechnic Institute of Portalegre, Portugal. In contrast to existing approaches that rely on high-dimensional input spaces and opaque predictive architectures, we develop a reduced-dimensional classification pipeline based on recursive feature elimination with Gradient Boosting and Random Forest models. Starting from a comprehensive set of demographic, academic, and financial indicators, only 20 informative predictors are retained for model construction, substantially reducing input complexity while preserving predictive capacity. Comparative evaluation across multiple learning algorithms identifies Gradient Boosting as the most effective model, achieving an AUC of 0.891. Beyond predictive accuracy, the proposed framework emphasizes model interpretability through the integration of SHapley Additive exPlanations (SHAP), enabling quantitative attribution of feature contributions at both global and instance levels. The analysis reveals that second-semester academic engagement variables—including the number of courses approved, evaluated, and enrolled—as well as tuition fee payment status and age at enrollment, are the dominant factors shaping student outcomes. Overall, the results demonstrate that strong classification performance can be achieved using a compact feature set while maintaining transparent and explainable model behavior. By combining mathematically grounded feature selection with principled model explanation, this study advances methodological understanding of how efficiency, interpretability, and predictive accuracy can be jointly optimized in applied machine learning, with implications for decision-support systems in educational analytics.
- New
- Research Article
- 10.3389/fepid.2026.1696282
- Feb 11, 2026
- Frontiers in Epidemiology
- Claris Shoko + 2 more
Introduction The COVID-19 pandemic posed significant challenges for public health systems, especially in Africa, where data scarcity, inadequate healthcare infrastructure, and regional disparities hindered effective forecasting and response efforts. Conventional forecasting methods have faced challenges in adequately addressing the complexity and detail necessary for effective policy interventions at various administrative levels. This study examines the challenge of producing accurate and coherent forecasts of COVID-19 cases within the hierarchical structure of Africa, which includes the continental, regional, and national levels. Methods To establish a comprehensive forecasting model that uses hierarchical time series forecasting through a bottom-up reconciliation approach augmented by machine learning algorithms. We employ extreme gradient boosting (XGBoost) and random forest models, subsequently improving predictive accuracy via a weighted average ensemble method. We produce forecasts at the national level and then aggregate them to ensure consistency across all hierarchical levels. The models are evaluated in comparison to conventional methods such as ARIMA and exponential smoothing. Results Empirical findings indicate that XGBoost is the best among all the single forecast models used in this study, combining forecasts from the XGBoost with the random forest and assigning more weights to the XGBoost surpasses all other models in the area of mean absolute error, root mean square error, and mean absolute scale error. Results further revealed that Southern Africa, despite its low population density, reported the highest number of cases, indicating underlying health vulnerabilities and socioeconomic factors. In summary, the bottom-up HTSF method, when combined with machine learning, serves as an effective tool for forecasting in environments with limited data availability. Discussion It is advisable to apply similar models to other infectious diseases and to expand their use to guide health interventions, resource allocation, and early warning systems in future pandemics.