Important Predictor Variables Research Articles

Background and ObjectiveColorectal cancer is a major health concern. It is now the third most common cancer and the fourth leading cause of cancer mortality worldwide. The aim of this study was to evaluate the performance of machine learning algorithms for predicting survival of colorectal cancer patients 1 to 5 years after diagnosis, and identify the most important variables. MethodsA sample of 1236 patients diagnosed with colorectal cancer and 118 predictor variables has been used. The outcome of interest was a binary variable indicating whether the patient survived the number of years in question or not. 20 predictor variables were selected using mutual information score with the outcome. We implemented 11 machine learning algorithms and evaluated their performance with a 5 by 2-fold cross-validation with stratified folds and with paired Student's t-tests. We compared the results with the Kaplan-Meier estimator and Cox's proportional hazard regression. ResultsUsing the 20 most important predictor variables for each of the survival years, the logistic regression algorithm achieved an area under the receiver operating characteristic curve of 0.850 (0.014 SD, 0.840-0.860 95 % CI) for the 1-year, and 0.872 (0.014 SD, 0.861-0.882 95% CI) for the 5-year survival prediction. Using only the 5 most important predictor variables, the corresponding values are 0.793 (0.020 SD, 0.778-0.807 95% CI) and 0.794 (0.011 SD, 0.785-0.802 95% CI). The most important variables for 1-year prediction were number of R residual, M distant metastasis, overall stage, probable recurrence within 5 years, and tumour length, whereas for 5-year prediction the most important were probable recurrence within 5 years, R residual, M distant metastasis, number of positive lymph nodes, and palliative chemotherapy. Biomarkers do not appear among the top 20 most important ones. For all survival intervals, the probability of the top model agrees with the Kaplan-Meier estimate, both in the interval of one standard deviation and in the 95% confidence interval. ConclusionsThe findings suggest that machine learning algorithms can predict the survival probability of colorectal cancer patients and can be used to inform the patients and assist decision-making in clinical care management. In addition, this study unveils the most essential variables for estimating survival short- and long-term among patients with Colorectal cancer.

Read full abstract

BackgroundIn individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians.ResultWe present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes.ConclusionsGdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen, together with vignettes, documentation for the functions and examples to guide you in each own implementation.

Read full abstract

Important Predictor Variables Research Articles

Articles published on Important Predictor Variables

University Students Attitudes toward Same-Sex Marriage Adoption in Taiwan

Interpretable Machine Learning Model Predicting Early Neurological Deterioration in Ischemic Stroke Patients Treated with Mechanical Thrombectomy: A Retrospective Study

An international comparison of haemoglobin deferral prediction models for blood banking.

Functional Predictor Variables for the Leaching Potential of Arsenic and Selenium from Coal Fly Ash

Using ensemble model to predict isothermal hydration heat of fly ash cement paste considering fly ash content, water to binder ratio and curing temperature

Machine learning-based modeling of surface sediment concentration in Doce river basin

Challenge, threat, coping potential: How primary and secondary appraisals of job demands predict nurses' affective states during the COVID-19 pandemic.

Artificial intelligence based personalized predictive survival among colorectal cancer patients

Surrogate tree ensemble model representing 2D population doses over complex terrain in the event of a radiological release into the air

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Comparative performance of Sentinel-2 MSI and Landsat-8 OLI data in canopy cover prediction using Random Forest model: Comparing model performance and tuning parameters

On the value of expert knowledge in estimation and forecasting of solar photovoltaic power generation

New insights into the benthic macrofauna composition and structure in a southern-west Mediterranean coastal lagoon after restoration actions: Spatial and Seasonal patterns

Estimating the mean cutting force of conical picks using random forest with salp swarm algorithm

Predicting the occurrence of an endangered salamander in a highly urbanized landscape

A data-driven on-site injury severity assessment model for car-to-electric-bicycle collisions based on positional relationship and random forest

The Quality of Life of Seniors with Eye Diseases during COVID-19.

Spatial modeling of two mosquito vectors of West Nile virus using integrated nested Laplace approximations

Extreme Rainfall and Flood Risk Prediction over the East Coast of South Africa

Analysis of the Correlation Between Healthy Lifestyle Patterns and Work Stress in Psychiatric Nurses

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Important Predictor Variables Research Articles

Articles published on Important Predictor Variables

University Students Attitudes toward Same-Sex Marriage Adoption in Taiwan

Interpretable Machine Learning Model Predicting Early Neurological Deterioration in Ischemic Stroke Patients Treated with Mechanical Thrombectomy: A Retrospective Study

An international comparison of haemoglobin deferral prediction models for blood banking.

Functional Predictor Variables for the Leaching Potential of Arsenic and Selenium from Coal Fly Ash

Using ensemble model to predict isothermal hydration heat of fly ash cement paste considering fly ash content, water to binder ratio and curing temperature

Machine learning-based modeling of surface sediment concentration in Doce river basin

Challenge, threat, coping potential: How primary and secondary appraisals of job demands predict nurses' affective states during the COVID-19 pandemic.

Artificial intelligence based personalized predictive survival among colorectal cancer patients

Surrogate tree ensemble model representing 2D population doses over complex terrain in the event of a radiological release into the air

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Comparative performance of Sentinel-2 MSI and Landsat-8 OLI data in canopy cover prediction using Random Forest model: Comparing model performance and tuning parameters

On the value of expert knowledge in estimation and forecasting of solar photovoltaic power generation

New insights into the benthic macrofauna composition and structure in a southern-west Mediterranean coastal lagoon after restoration actions: Spatial and Seasonal patterns

Estimating the mean cutting force of conical picks using random forest with salp swarm algorithm

Predicting the occurrence of an endangered salamander in a highly urbanized landscape

A data-driven on-site injury severity assessment model for car-to-electric-bicycle collisions based on positional relationship and random forest

The Quality of Life of Seniors with Eye Diseases during COVID-19.

Spatial modeling of two mosquito vectors of West Nile virus using integrated nested Laplace approximations

Extreme Rainfall and Flood Risk Prediction over the East Coast of South Africa

Analysis of the Correlation Between Healthy Lifestyle Patterns and Work Stress in Psychiatric Nurses