Comparing Logistic Regression, Multinomial Regression, Classification Trees and Random Forests Applied to Ternary Variables

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The authors apply logistic regression, multinomial regression, classification trees and random forests to a ternary outcome variable: the variation between the ’s-genitive, the of-genitive and functionally equivalent noun + noun combinations. The statistical approaches discussed fall into regression models on the one hand and classification trees on the other. Specifically, as an alternative to successive binomial regression analyses, the authors implement a multinomial model, which can analyse the entire dataset with three outcome categories simultaneously. Further, a basic classification tree is calculated alongside a more complex (and more robust) random forest. The chapter does not only weigh advantages and shortcomings of all four models, but it also explicates the different rationales and interpretations that come with them. As a major insight, it emerges that the nature of the dataset, the analytic purpose and the statistical model are interdependent and condition each other in several non-trivial respects.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.25165/ijabe.v12i3.4325
Spectral difference analysis and identification of different maturity blueberry fruit based on hyperspectral imaging using spectral index
  • Jun 5, 2019
  • International Journal of Agricultural and Biological Engineering
  • Hao Ma + 5 more

Hyperspectral imaging, with many narrow bands of spectra, is strongly capable to detect or classify objects. It has been become one research hotspot in the field of near-ground remote sensing. However, the higher demands for computing and complex operating of instrument are still the bottleneck for hyperspectral imaging technology applied in field. Band selection is a common way to reduce the dimensionality of hyperspectral imaging cube and simplify the design of spectral imaging instrument. In this research, hyperspectral images of blueberry fruit were collected both in the laboratory and in field. A set of spectral bands were selected by analyzing the differences among blueberry fruits at different growth stages and backgrounds. Furthermore, a normalized spectral index was set up using the bands selected to identify the three growth stages of blueberry fruits, aiming to eliminate the impact of background included leaf, branch, soil, illumination variation and so on. Two classifiers of spectral angle mapping (SAM), multinomial logistic regression (MLR) and classification tree were used to verify the results of identification of blueberry fruit. The detection accuracy was 82.1% for SAM classifier using all spectral bands, 88.5% for MLR classifier using selected bands and 89.8% for decision tree using the spectral index. The results indicated that the normalization spectral index can both lower the complexity of computing and reduce the impact of noisy background in field. Keywords: spectral difference analysis, hyperspectral imaging, spectral index, band selection, blueberry fruit identification DOI: 10.25165/j.ijabe.20191203.4325 Citation: Ma H, Zhao K X, Jin X, Ji J T, Qiu Z M, Gao S. Spectral difference analysis and identification of different maturity blueberry fruit based on hyperspectral imaging using spectral index. Int J Agric & Biol Eng, 2019; 12(3): 134–140.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s00586-024-08132-w
Development and external validation of a predictive model for prolonged length of hospital stay in elderly patients undergoing lumbar fusion surgery: comparison of three predictive models.
  • Jan 30, 2024
  • European spine journal : official publication of the European Spine Society, the European Spinal Deformity Society, and the European Section of the Cervical Spine Research Society
  • Shuai-Kang Wang + 6 more

This study aimed to develop a predictive model for prolonged length of hospital stay (pLOS) in elderly patients undergoing lumbar fusion surgery, utilizing multivariate logistic regression, single classification and regression tree (hereafter, "classification tree") and random forest machine-learning algorithms. This study was a retrospective review of a prospective Geriatric Lumbar Disease Database. The primary outcome measure was pLOS, which was defined as the LOS greater than the 75th percentile. All patients were grouped as pLOS group and non-pLOS. Three models (including logistic regression, single-classification tree and random forest algorithms) for predicting pLOS were developed using training dataset and internal validation using testing dataset. Finally, online tool based on our model was developed to assess its validity in the clinical setting (external validation). The development set included 1025 patients (mean [SD] age, 72.8 [5.6] years; 632 [61.7%] female), and the external validation set included 175 patients (73.2 [5.9] years; 97[55.4%] female). Multivariate logistic analyses revealed that older age (odds ratio [OR] 1.06, p < 0.001), higher BMI (OR 1.08, p = 0.002), number of fused segments (OR 1.41, p < 0.001), longer operative time (OR 1.02, p < 0.001), and diabetes (OR 1.05, p = 0.046) were independent risk factors for pLOS in elderly patients undergoing lumbar fusion surgery. The single-classification tree revealed that operative time ≥ 232min, delayed ambulation, and BMI ≥ 30kg/m2 as particularly influential predictors for pLOS. A random forest model was developed using the remaining 14 variables. Intraoperative EBL, operative time, delayed ambulation, age, number of fused segments, BMI, and RBC count were the most significant variables in the final model. The predictive ability of our three models was comparable, with no significant differences in AUC (0.73 vs. 0.71 vs. 0.70, respectively). The logistic regression model had a higher net benefit for clinical intervention than the other models. The nomogram was developed, and the C-index of external validation for PLOS was 0.69 (95% CI, 0.65-0.76). This investigation produced three predictive models for pLOS in elderly patients undergoing lumbar fusion surgery. The predictive ability of our three models was comparable. Logistic regression model had a higher net benefit for clinical intervention than the other models. Our predictive model could inform physicians about elderly patients with a high risk of pLOS after surgery.

  • Research Article
  • Cite Count Icon 54
  • 10.1016/j.geodrs.2014.07.001
Refining a reconnaissance soil map by calibrating regression models with data from the same map (Normandy, France)
  • Jul 23, 2014
  • Geoderma Regional
  • Fanny Collard + 7 more

Refining a reconnaissance soil map by calibrating regression models with data from the same map (Normandy, France)

  • Research Article
  • Cite Count Icon 5
  • 10.30865/mib.v6i4.4546
Work Readiness Prediction of Telkom University Students Using Multinomial Logistic Regression and Random Forest Method
  • Oct 25, 2022
  • JURNAL MEDIA INFORMATIKA BUDIDARMA
  • Haura Athaya Salka + 1 more

Work readiness for college graduates is an essential and significant thing to get a job immediately after graduation. But what happens is that many graduates are unemployed after graduation or do not get jobs that match the majors they have studied for more than four years. Therefore, by using a people analytics approach, this study aims to predict the work readiness of Telkom University students and find out what factors affect student work-readiness after graduation. The model built is a multi-classes classification model. This model uses Chi-square Test calculation for feature selection, Multinomial Logistic Regression and Random Forest as a classification method, and confusion matrix as an evaluation method. Multinomial Logistic Regression is used because several studies use this algorithm for categorical data, while Random Forest is used to compare which model produces better accuracy. This study conducted several test scenarios, which obtained the best model by performing hyperparameter tuning and handling unbalanced data with SMOTE-ENN. Handling imbalanced data with SMOTE-ENN is used to improve accuracy scores and predict classes well, especially for minority class. The best accuracy of the Multinomial Logistic Regression method is 53.9%, and Random Forest is 48.5%.

  • Research Article
  • Cite Count Icon 2
  • 10.36390/telos271.08
Estudio del rendimiento académico mediante la comparación de modelos de regresión y árboles de clasificación
  • Jan 15, 2025
  • Telos: Revista de Estudios Interdisciplinarios en Ciencias Sociales
  • Johanna Enith Aguilar-Reyes + 3 more

This article aims to identify the factors that affect academic performance by comparing regression models and decision trees to determine the factors involved. The methodology adopted is quantitative in nature, focused on the collection of numerical data and its statistical analysis, in order to evaluate the relationships between different variables and determine those factors that influence academic performance. The population studied includes remedial students in the statistics career, who underwent an exploratory and descriptive analysis, using two statistical methods. Two modeling techniques were used: multinomial logistic regression and classification trees. The variables evaluated included sociodemographic factors, previous academic performance, and characteristics of the educational environment. The results showed that the logistic regression model achieved 100% accuracy with an AUC of 1, indicating perfect classification ability. In comparison, the classification tree model had an accuracy of 70.83% with an AUC of 0.7042, reflecting moderate classification ability. From these results, key factors that affect academic performance were identified, such as study habits, interest in the career and psychological aspects. In conclusion, multinomial logistic regression was more effective and accurate in analyzing the quantitative relationships between the variables that affect academic performance, outperforming the classification tree method.

  • Research Article
  • Cite Count Icon 133
  • 10.1016/j.geomorph.2017.02.015
Comparing the efficiency of digital and conventional soil mapping to predict soil types in a semi-arid region in Iran
  • Feb 24, 2017
  • Geomorphology
  • Mojtaba Zeraatpisheh + 3 more

Comparing the efficiency of digital and conventional soil mapping to predict soil types in a semi-arid region in Iran

  • Research Article
  • Cite Count Icon 9
  • 10.1007/s00787-019-01334-4
Predicting mental health improvement and deterioration in a large community sample of 11- to 13-year-olds
  • May 3, 2019
  • European Child & Adolescent Psychiatry
  • Miranda Wolpert + 9 more

Of children with mental health problems who access specialist help, 50% show reliable improvement on self-report measures at case closure and 10% reliable deterioration. To contextualise these figures it is necessary to consider rates of improvement for those in the general population. This study examined rates of reliable improvement/deterioration for children in a school sample over time. N = 9074 children (mean age 12; 52% female; 79% white) from 118 secondary schools across England provided self-report mental health (SDQ), quality of life and demographic data (age, ethnicity and free school meals (FSM) at baseline and 1 year and self-report data on access to mental health support at 1 year). Multinomial logistic regressions and classification trees were used to analyse the data. Of 2270 (25%) scoring above threshold for mental health problems at outset, 27% reliably improved and 9% reliably deteriorated at 1-year follow up. Of 6804 (75%) scoring below threshold, 4% reliably improved and 12% reliably deteriorated. Greater emotional difficulties at outset were associated with greater rates of reliable improvement for both groups (above threshold group: OR = 1.89, p < 0.001, 95% CI [1.64, 2.17], below threshold group: OR = 2.23, p < 0.001, 95% CI [1.93, 2.57]). For those above threshold, higher baseline quality of life was associated with greater likelihood of reliable improvement (OR = 1.28, p < 0.001, 95% CI [1.13, 1.46]), whilst being in receipt of FSM was associated with reduced likelihood of reliable improvement (OR = 0.68, p < 0.01, 95% CI [0.53, 0.88]). For the group below threshold, being female was associated with increased likelihood of reliable deterioration (OR = 1.20, p < 0.025, 95% CI [1.00, 1.42]), whereas being from a non-white ethnic background was associated with decreased likelihood of reliable deterioration (OR = 0.66, p < 0.001, 95% CI [0.54, 0.80]). For those above threshold, almost one in three children showed reliable improvement at 1 year. The extent of emotional difficulties at outset showed the highest associations with rates of reliable improvement.

  • Research Article
  • Cite Count Icon 6
  • 10.2147/oams.s69707
Identifying determinants and estimating the risk of inadequate and excess gestational weight gain using a multinomial logistic regression model
  • Dec 1, 2014
  • Open Access Medical Statistics
  • Joseph Beyene + 2 more

Identifying determinants and estimating the risk of inadequate and excess gestational weight gain using a multinomial logistic regression model Binod Neupane,1 Sarah D McDonald,1,2 Joseph Beyene1 1Department of Clinical Epidemiology and Biostatistics, 2Department of Obstetrics and Gynecology and Radiology, McMaster University, Hamilton, ON, Canada Abstract: When there are three or more nominal categories of a response variable, the binomial logistic regression approach is widely used to model the relationships of exposure variables with different binomial responses one at a time. However, some of the separate binomial comparisons would be redundant. This approach is also suboptimal because of the loss of information that will result when only a subset of the data is analyzed at a time and the multiple testing problems arising from analysis of several pairs of categories. These drawbacks of fitting separate binomial regression models to a multicategory nominal outcome variable can be overcome using a single multinomial regression modeling framework. In this study, we compared the results using a multinomial regression with the separate two binomial regressions to determine factors associated with excess and inadequate weight gain during pregnancy in a data set from a gestational weight gain study involving a cross-sectional survey of 312 women with singleton pregnancies. We found that both approaches identified the same set of predictors, ie, higher neuroticism, planning to gain more weight than the recommended level, and bedtime television watching, with P-values &le;0.05 of the excessive (versus appropriate) weight gain, for which the subgroup size was moderate. The final list of significant predictors of inadequate (versus appropriate) weight gain identified by multinomial regression were planned weight gain below the recommended range, overweight or obese women, and bedtime television watching, while those by a separate binomial approach were self-efficacy towards achieving healthy weight, lack of weight satisfaction, and bedtime television watching, which differed between the two approaches where the final set of predictors were identified by a variable selection process and the comparisons were made in a small subgroup. A multinomial approach is a useful analytical framework that researchers may consider when they have multinomial response categories because this approach allows nonredundant comparisons to be made, avoiding the need to analyze a subset of the data one at a time and also allows for risk prediction of multinomial categories from a well validated multinomial model, and will not lead to multiple testing problems. Keywords: gestational weight gain, pregnancy, multinomial logistic regression, binomial logistic regression, risk factors

  • Research Article
  • Cite Count Icon 12
  • 10.1111/jgh.15478
Prediction model for bleeding after endoscopic submucosal dissection of gastric neoplasms from a high-volume center.
  • Mar 9, 2021
  • Journal of gastroenterology and hepatology
  • Yeon Hwa Choe + 6 more

Bleeding after endoscopic submucosal dissection (ESD) is a main adverse event. To date, although there have been several studies about risk factors for post-ESD bleeding, there has been few predictive model for post-ESD bleeding with large volume cases. We aimed to design a prediction model for post-ESD bleeding using a classification tree model. We analyzed a prospectively established cohort of patients with gastric neoplasms treated with ESD from 2007 to 2016. Baseline characteristics were collected for a total of 5080 patients, and the bleeding risk was estimated using variable statistical methods such as logistic regression, AdaBoost, and random forest. To investigate how bleeding was affected by independent predictors, the classification and regression tree (CART) method was used. The prediction tree developed for the cohort was internally validated. Post-ESD bleeding occurred in 262 of 5080 patients (5.1%). In multivariate logistic regression, ongoing antithrombotic use during the procedure, cancer pathology, and piecemeal resection were significant risk factors for post-ESD bleeding. In the CART model, the decisive variables were ongoing antithrombotic agent use, resected specimen size ≥49mm, and patient age <62years. The CART model accuracy was 94.9%, and the cross-validation accuracy was 94.8%. We developed a simple and easy-to-apply predictive tree model based on three risk factors that could help endoscopists identify patients at a high risk of bleeding. This model will enable clinicians to establish precise management strategies for patients at a high risk of bleeding and to prevent post-ESD bleeding.

  • Research Article
  • Cite Count Icon 324
  • 10.1016/j.jclinepi.2012.11.008
Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes
  • Feb 4, 2013
  • Journal of Clinical Epidemiology
  • Peter C Austin + 4 more

Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes

  • Research Article
  • Cite Count Icon 29
  • 10.1038/s41374-021-00662-x
Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models
  • Mar 1, 2022
  • Laboratory Investigation
  • Catherine H Feng + 3 more

Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models

  • Research Article
  • Cite Count Icon 33
  • 10.1016/s1088-467x(99)00003-7
The application of non-parametric techniques to solve classification problems in complex data sets in veterinary epidemiology – An example
  • May 1, 1999
  • Intelligent Data Analysis
  • Katharina D.C Stärk

The application of non-parametric techniques to solve classification problems in complex data sets in veterinary epidemiology – An example

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-981-99-0741-0_24
Multi-class Classification for Breast Cancer with High Dimensional Microarray Data Using Machine Learning Classifier
  • Jan 1, 2023
  • Mohammad Nasir Abdullah + 3 more

Breast cancer is one of the leading causes of cancer related deaths among women. Early detection of breast cancer is very important for proper treatment and decreasing the death risk among women. Most cancer prediction study focused on binary classification of breast cancer. This study focused on multi-class classification of breast cancer with high dimensional microarray data. The dataset involved 38 cancer patients, 3 categories: normal (9), early tumour (12), and late tumor (17), and 39,426 microarray biomarkers. Boruta’s feature selection algorithm selected 28 important microarray biomarkers. The performance of support vector machine, multinomial logistic regression, Naïve Bayes, and random forest were evaluated based on macro and micro accuracy, sensitivity, and precision. Results showed that multinomial logistic regression, Naïve Bayes and random forest exhibits overfitting issue. However, support vector machine performed well in multi-classification of breast cancer (macro_acctest = 86.7%, macro_sentest = 77.8%, and macro_prectest = 62.0%). In future work, bagging, and boosting with over sampling techniques can be considered to improve multi-class classification of breast cancer using high dimensional microarray data.

  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.jpsychires.2022.05.021
Correlates of cannabis use disorder in the United States: A comparison of logistic regression, classification trees, and random forests
  • May 23, 2022
  • Journal of Psychiatric Research
  • Nathaniel A Dell + 4 more

Correlates of cannabis use disorder in the United States: A comparison of logistic regression, classification trees, and random forests

  • Research Article
  • 10.1111/ocr.70100
Artificial Intelligence-Assisted Clinical Decision Model for Managing Retained Second Deciduous Molars With No Permanent Successors.
  • Jan 16, 2026
  • Orthodontics & craniofacial research
  • Ozge Colak + 3 more

The aim of this study was to develop and apply an artificial intelligence (AI) algorithm to aid the clinical decision-making process for managing mandibular retained second deciduous molars (SDM) with no permanent successors using machine learning. This retrospective study consisted of patients who were diagnosed with at least one congenitally missing (agenic) mandibular permanent second premolar with a retained SDM. Pretreatment clinical records from each patient were collected and three sets of input features (radiographic, photographic and clinical) were used. The sample was divided into three groups, each representing a distinct treatment decision: (1) extraction of the SDM with space closure; (2) extraction of the SDM with space maintenance; and (3) retention of the SDM. The treatment decisions were based on majority treatment determination by three experienced clinicians. Four machine learning models were built and evaluated: Multinomial Logistic Regression, Multilayer Perceptron, Decision Tree and Random Forest classifier. Random Forest classifier showed the highest accuracy in treatment planning while Decision Tree showed the lowest accuracy. Features such as patient preference for restoration, amount of mandibular arch crowding and ankylosis were the strongest predictors, having the greatest influence on treatment decision accuracy in the Random Forest classifier model. The Random Forest classifier demonstrated the highest accuracy in aiding the clinical decision-making process for managing retained SDM with no permanent successors. Key factors influencing treatment decision accuracy included patient preference for restoration, mandibular arch crowding and ankylosis.

Save Icon
Up Arrow
Open/Close