Comparing Generalised Linear Mixed-Effects Models, Generalised Linear Mixed-Effects Model Trees and Random Forests

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

In a comparison of generalised linear mixed-effects models, generalised linear mixed-effects model trees and random forests, the author applies the three methodologies to a binary variable from the field of interactional pragmatics, the choice between filled and unfilled pauses across varieties of English represented by components of the International Corpus of English. Based on a large number of examples annotated for linguistic and extralinguistic factors the steps and decisions involved in the analyses are demonstrated. Though different in essence, the three resulting models share central trends. A more fine-grained evaluation of results and interpretations shows, however, that the three approaches differ in their systematicity of handling multiple observations from the same source, in that only the mixed-effects models explicitly account for and systematically partial out the relatedness of data points contributed by the same speaker. As to the way the approaches balance researcher involvement and control of the outcome, the approaches also differ substantially. A modelling choice can thus lead to notably different perspectives on an identical set of data and variables.

Similar Papers
  • Research Article
  • Cite Count Icon 23
  • 10.1016/j.cherd.2018.12.002
A data-centric predictive control approach for nonlinear chemical processes
  • Dec 10, 2018
  • Chemical Engineering Research and Design
  • Ruigang Wang + 2 more

A data-centric predictive control approach for nonlinear chemical processes

  • Research Article
  • Cite Count Icon 4
  • 10.46989/001c.25146
Haematological parameters of Cyprinus carpio with reference to probiotic feed: A machine learning approach
  • Jun 25, 2021
  • Israeli Journal of Aquaculture - Bamidgeh
  • Shree Rama Mani + 5 more

The study aims to analyze the haematological parameters of Cyprinus carpio with reference to the formulation of the probiotic fortified feeds using a machine learning approach. C. carpio fed with pelletized feed, probiotic pelletized feed (5% Lysinibacillus macroides), probiotic pearl beads (5% L. macroides) and probiotic rice puff (5% L. macroides) for 60 days. At the end of the experiments, using blood samples, the haematological indices such as leucocytes, erythrocytes, hemoglobin, hematocrit and packed-cell-volume, were analyzed. Duncan’s Multiple Range Test showed that the haematological parameters in control feeding regimes significantly (P<0.05) were low compared with that of the probiotic feeding regimes. The data sets of different feeding regimes were classified using the machine learning method. In the present study, the classifiers like the Random Forest, the Linear Model, and the Decision Tree were employed. To identify the relationship between the features, correlation coefficient and dendrogram were applied. The results of the machine learning method showed high accuracy (98%) in random forest methods followed by the decision tree method. The correlation coefficient between the haematological indices recorded a positive value. But, calculated values of mean corpuscular volume, mean corpuscular hemoglobin and mean corpuscular haemoglobin concentration were either low positive or negatively correlated with other haematological indices. Based on the results, the Random Forest, Linear Model and Decision Tree Analysis might be considered for haematological classification of the fish haematological data set.

  • Research Article
  • Cite Count Icon 11
  • 10.1007/s10994-024-06590-3
Fast linear model trees by PILOT
  • Jul 8, 2024
  • Machine Learning
  • Jakob Raymaekers + 3 more

Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an L2 boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for PIecewise Linear Organic Tree, where ‘organic’ refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial.

  • Research Article
  • Cite Count Icon 8
  • 10.1109/access.2022.3233194
Random Interaction Forest (RIF)–A Novel Machine Learning Strategy Accounting for Feature Interaction
  • Jan 1, 2023
  • IEEE Access
  • Chao-Yu Guo + 1 more

If an interaction exists in medical and health sciences, a proper statistical approach is required to avoid an erroneous conclusion. For example, different genders may introduce modified therapeutic effects of drugs, or an adverse interaction between two medicines changes the pharmacological activity, reduces the therapeutic effect, or induces toxicity. Therefore, if the analysis does not account for the impact of the interaction, it may introduce significant prediction errors or bias. Regression models deal with a two-way interaction by adding the product of the two interactive variables. Since machine learning models demonstrate a superior predictive ability to regression models, this study proposes a new method based on the random forest to account for interaction, called random interaction forest (RIF). This new strategy modifies the structure of the random forest, where the interaction features are forced to be in the first two nodes. Simulation studies examined the predictive ability of the linear regression model, logistic regression model, random forest, and the RIF under various scenarios. The results showed that the RIF consistently outperforms random forest and logistic regression when interactions are present. The RIF also performs better in many scenarios than the linear regression model. When the effect of interaction is more significant, the performance of RIF could be superior.

  • Research Article
  • Cite Count Icon 42
  • 10.1038/s41598-021-99164-5
Using machine learning methods for supporting GR2M model in runoff estimation in an ungauged basin
  • Oct 7, 2021
  • Scientific Reports
  • Pakorn Ditthakit + 5 more

Estimating monthly runoff variation, especially in ungauged basins, is inevitable for water resource planning and management. The present study aimed to evaluate the regionalization methods for determining regional parameters of the rainfall-runoff model (i.e., GR2M model). Two regionalization methods (i.e., regression-based methods and distance-based methods) were investigated in this study. Three regression-based methods were selected including Multiple Linear Regression (MLR), Random Forest (RF), and M5 Model Tree (M5), and two distance-based methods included Spatial Proximity Approach and Physical Similarity Approach (PSA). Hydrological data and the basin's physical attributes were analyzed from 37 runoff stations in Thailand's southern basin. The results showed that using hydrological data for estimating the GR2M model parameters is better than using the basin's physical attributes. RF had the most accuracy in estimating regional GR2M model’s parameters by giving the lowest error, followed by M5, MLR, SPA, and PSA. Such regional parameters were then applied in estimating monthly runoff using the GR2M model. Then, their performance was evaluated using three performance criteria, i.e., Nash–Sutcliffe Efficiency (NSE), Correlation Coefficient (r), and Overall Index (OI). The regionalized monthly runoff with RF performed the best, followed by SPA, M5, MLR, and PSA. The Taylor diagram was also used to graphically evaluate the obtained results, which indicated that RF provided the products closest to GR2M's results, followed by SPA, M5, PSA, and MLR. Our finding revealed the applicability of machine learning for estimating monthly runoff in the ungauged basins. However, the SPA would be recommended in areas where lacking the basin's physical attributes and hydrological information.

  • Research Article
  • Cite Count Icon 15
  • 10.1016/j.knosys.2024.112694
Intelligent fault diagnosis for tribo-mechanical systems by machine learning: Multi-feature extraction and ensemble voting methods
  • Oct 28, 2024
  • Knowledge-Based Systems
  • V Shandhoosh + 5 more

Timely fault detection is crucial for preventing issues like worn clutch plates and excessive friction material degradation, enhancing fuel efficiency, and prolonging clutch lifespan. This study focuses on early fault diagnosis in dry friction clutch systems using machine learning (ML) techniques. Vibration data is analyzed under different load and fault conditions, extracting statistical, histogram, and auto-regressive moving average (ARMA) features. Feature selection employs the J48 decision tree algorithm, evaluated with eight ML classifiers: support vector machines (SVM), k-nearest neighbor (kNN), linear model tree (LMT), random forest (RF), multilayer perceptron (MLP), logistic regression (LR), J48, and Naive Bayes. The evaluation revealed that individual classifiers achieved the highest testing accuracies with statistical feature selection as 83% for both MLP and LR at no load, 90% for MLP at 5 kg, and 93% for KNN at 10 kg. For histogram feature selection, KNN and MLP both reached 85% at no load, MLP achieved 91% at 5 kg, and RF attained 97% at 10 kg. With ARMA feature selection, KNN reached 93% at no load, LR achieved 94% at 5 kg, and RF reached 86% at 10 kg. The voting strategy notably improved these results, with the RF-KNN-J48 ensemble reaching 98% for histogram features at 10 kg, the KNN-LMT-RF ensemble achieving 94% for ARMA features at no load, and the SVM-MLP-LMT ensemble attaining 95% for ARMA features at 5 kg. Hence, a combination of three classifiers using the majority voting rule consistently outperforms standalone classifiers, striking a balance between diversity and complexity, facilitating robust decision-making. In practical applications, selecting the optimal combination of feature selection method and classifier is vital for accurate fault classification. This study provides valuable guidance for engineers and practitioners implementing robust load classification systems in industrial settings.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 72
  • 10.3390/rs11111371
Estimation of Rice Growth Parameters Based on Linear Mixed-Effect Model Using Multispectral Images from Fixed-Wing Unmanned Aerial Vehicles
  • Jun 8, 2019
  • Remote Sensing
  • Yanyu Wang + 7 more

The accurate estimation of aboveground biomass (AGB) and leaf area index (LAI) is critical to characterize crop growth status and predict grain yield. Unmanned aerial vehicle (UAV) -based remote sensing has attracted significant interest due to its high flexibility and easiness of operation. The mixed effect model introduced in this study can capture secondary factors that cannot be captured by standard empirical relationships. The objective of this study was to explore the potential benefit of using a linear mixed-effect (LME) model and multispectral images from a fixed-wing UAV to estimate both AGB and LAI of rice. Field experiments were conducted over two consecutive years (2017–2018), that involved different N rates, planting patterns and rice cultivars. Images were collected by a compact multispectral camera mounted on a fixed-wing UAV during key rice growth stages. LME, simple regression (SR), artificial neural networks (ANN) and random forests (RF) models were developed relating growth parameters (AGB and LAI) to spectral information. Cultivar (C), growth stage (S) and planting pattern (P) were selected as candidates of random effects for the LME models due to their significant effects on rice growth. Compared to other regression models (SR, ANN and RF), the LME model improved the AGB estimation accuracy for all stage groups to varying degrees: the R2 increased by 0.14–0.35 and the RMSE decreased by 0.88–1.80 t ha−1 for the whole season, the R2 increased by 0.07–0.15 and the RMSE decreased by 0.31–0.61 t ha−1 for pre-heading stages and the R2 increased by 0.21–0.53 and the RMSE decreased by 0.72–1.52 t ha−1 for post-heading stages. Further analysis suggested that the LME model also successfully predicted within the groups when the number of groups was suitable. More importantly, depending on the availability of C, S, P or combinations thereof, mixed effects could lead to an outperformance of baseline retrieval methods (SR, ANN or RF) due to the inclusion of secondary effects. Satisfactory results were also obtained for the LAI estimation while the superiority of the LME model was not as significant as that for AGB estimation. This study demonstrates that the LME model could accurately estimate rice AGB and LAI and fixed-wing UAVs are promising for the monitoring of the crop growth status over large-scale farmland.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/icts52701.2021.9608492
Developing Accurate Predictive Model Using Computational Intelligence for Optimal Inventory Management
  • Oct 20, 2021
  • Michael Siek + 1 more

People are all currently living in the world where data has changed how company think, act and plan. Data, if used correctly, might be able to become a company's sharpest weapon in fighting the competition with other companies. Inventory cost is one of the most burdening costs in the food and beverage industry with the items like degradable raw materials or fresh ingredients. If not managed correctly might become a waste causing loss to the company. Degraded ingredients also might lower the overall food quality which might result in unsatisfied customers. Managing inventory, however, is not as easy as it seems, especially with the traditional method. This paper focuses on development of accurate predictive model using computational intelligence for optimal inventory management with a case study of restaurant ingredient management. Several machine learning algorithms like linear regression, multi-layer perceptron, random tree, random forest, and model trees were utilized to build accurate predictive models from time series data of the restaurant inventory. With good prediction system using computational intelligence, the inventory cost and wasted ingredients can be significantly reduced, which this eventually maximizes the profit.

  • Research Article
  • 10.17849/insm-47-01-23-30.1
Regular Expressions: Mixed Effects Models.
  • Jan 1, 2017
  • Journal of Insurance Medicine
  • David Wesley

Regular Expressions: Mixed Effects Models.

  • Research Article
  • Cite Count Icon 2
  • 10.11591/ijeecs.v29.i3.pp1560-1566
Machine learning prediction of video-based learning with technology acceptance model
  • Mar 1, 2023
  • Indonesian Journal of Electrical Engineering and Computer Science
  • Rahayu Abdul Rahman + 4 more

<span lang="EN-US">COVID-19 outbreak has significant impacts on education system as almost all countries shift to new way of teaching and learning; online learning. In this new environment, various innovative teaching methods have been created to deliver educational material in ensuring the learning outcomes such as video content. Thus, this research aims to implement machine learning prediction models for video-based learning in higher education institutions. Using survey data from 103 final year accounting students at Malaysian public university, this paper presents the fundamental frameworks of evaluating three machine learning models namely generalized linear model, random forest and decision tree. Besides demography attributes, the performance of each machine learning algorithm on the video-based learning usage has been observed based on the attributes of technology acceptance model namely perceived ease of use, perceived usefulness and attitude. The findings revealed that the perceived ease of use has given the highest weight of contributions to the generalized linear model and random forest while the major effects in decision tree has been given by the attitude variable. However, generalized linear model outperformed the two algorithms in term of the prediction accuracy.</span>

  • Research Article
  • Cite Count Icon 45
  • 10.1016/j.ecoinf.2014.10.002
Predicting potential impacts of climate change on freshwater fish in Korea
  • Oct 24, 2014
  • Ecological Informatics
  • Yong-Su Kwon + 4 more

Predicting potential impacts of climate change on freshwater fish in Korea

  • Research Article
  • Cite Count Icon 23
  • 10.3389/fgene.2023.1290036
Machine learning-based integrated identification of predictive combined diagnostic biomarkers for endometriosis
  • Nov 27, 2023
  • Frontiers in Genetics
  • Haolong Zhang + 5 more

Background: Endometriosis (EM) is a common gynecological condition in women of reproductive age, with diverse causes and a not yet fully understood pathogenesis. Traditional diagnostics rely on single diagnostic biomarkers and does not integrate a variety of different biomarkers. This study introduces multiple machine learning techniques, enhancing the accuracy of predictive models. A novel diagnostic approach that combines various biomarkers provides a new clinical perspective for improving the diagnostic efficiency of endometriosis, holding significant potential for clinical application.Methods: In this study, GSE51981 was used as a test set, and 11 machine learning algorithms (Lasso, Stepglm, glmBoost, Support Vector Machine, Ridge, Enet, plsRglm, Random Forest, LDA, XGBoost, and NaiveBayes) were employed to construct 113 predictive models for endometriosis. The optimal model was determined based on the AUC values derived from various algorithms. These genes were then evaluated using nine machine learning algorithms (Random Forest, SVM, Gradient Boosting Machine, LASSO, XGB, NNET, Generalized Linear Model, KNN, and Decision Tree) to assess significance scores and identify diagnostic genes for each algorithm. The diagnostic value of these genes was further validated in external datasets from GSE7305, GSE11691, and GSE120103.Results: Analysis of the GSE51981 dataset revealed 62 DEGs. The Stepglm [Both] and plsRglm algorithms identified 30 genes with the most potential using the AUC evaluation. Subsequently, nine machine learning algorithms were applied to select diagnostic genes, leading to the identification of five key diagnostic genes using the LASSO algorithm. The ADAT1 gene exhibited the best single-gene predictive performance, with an AUC of 0.785. A combination of genes (FOS, EPHX1, DLGAP5, PCSK5, and ADAT1) achieves an AUC of 0.836 in the test dataset. Moreover, these genes consistently exhibited an AUC exceeding 0.78 in all validation datasets, demonstrating superior predictive performance. Furthermore, correlation analysis with immune infiltration strengthened their predictive value by demonstrating the close relationship of the diagnostic genes with immune infiltrating cells.Conclusion: A combination of biomarkers consisting of FOS, EPHX1, DLGAP5, PCSK5, and ADAT1 can serve as a diagnostic tool for endometriosis, enhancing diagnostic efficiency. The association of these genes with immune infiltrating cells reveals their potential role in the pathogenesis of endometriosis, providing new insights for early detection and treatment.

  • Research Article
  • Cite Count Icon 7
  • 10.3758/s13428-024-02389-1
Subgroup detection in linear growth curve models with generalized linear mixed model (GLMM) trees
  • Jan 1, 2024
  • Behavior Research Methods
  • Marjolein Fokkema + 1 more

Growth curve models are popular tools for studying the development of a response variable within subjects over time. Heterogeneity between subjects is common in such models, and researchers are typically interested in explaining or predicting this heterogeneity. We show how generalized linear mixed-effects model (GLMM) trees can be used to identify subgroups with different trajectories in linear growth curve models. Originally developed for clustered cross-sectional data, GLMM trees are extended here to longitudinal data. The resulting extended GLMM trees are directly applicable to growth curve models as an important special case. In simulated and real-world data, we assess performance of the extensions and compare against other partitioning methods for growth curve models. Extended GLMM trees perform more accurately than the original algorithm and LongCART, and similarly accurate compared to structural equation model (SEM) trees. In addition, GLMM trees allow for modeling both discrete and continuous time series, are less sensitive to (mis-)specification of the random-effects structure and are much faster to compute.

  • Research Article
  • Cite Count Icon 31
  • 10.1080/00273171.2022.2146638
Gradient Tree Boosting for Hierarchical Data
  • Nov 14, 2022
  • Multivariate Behavioral Research
  • Marie Salditt + 2 more

Gradient tree boosting is a powerful machine learning technique that has shown good performance in predicting a variety of outcomes. However, when applied to hierarchical (e.g., longitudinal or clustered) data, the predictive performance of gradient tree boosting may be harmed by ignoring the hierarchical structure, and may be improved by accounting for it. Tree-based methods such as regression trees and random forests have already been extended to hierarchical data settings by combining them with the linear mixed effects model (MEM). In the present article, we add to this literature by proposing two algorithms to estimate a combination of the MEM and gradient tree boosting. We report on two simulation studies that (i) investigate the predictive performance of the two MEM boosting algorithms and (ii) compare them to standard gradient tree boosting, standard random forest, and other existing methods for hierarchical data (MEM, MEM random forests, model-based boosting, Bayesian additive regression trees [BART]). We found substantial improvements in the predictive performance of our MEM boosting algorithms over standard boosting when the random effects were non-negligible. MEM boosting as well as BART showed a predictive performance similar to the correctly specified MEM (i.e., the benchmark model), and overall outperformed the model-based boosting and random forest approaches.

  • Book Chapter
  • Cite Count Icon 4
  • 10.1007/978-3-319-95162-1_43
Predicting Particulate Matter for Assessing Air Quality in Delhi Using Meteorological Features
  • Jan 1, 2018
  • Apeksha Aggarwal + 1 more

Air pollution is one of the biggest threats to the environment. According to statistics of World Health Organization, more than 80% of people living in urban areas inhale poor air quality levels. Hence assessing air quality is important especially in urban areas where people suffer more health problems due to poor air quality. Data mining techniques can serve to be very useful for analyzing the air quality data. In the past, several research works were done for various developing countries of the world, except a few for developing countries, like India. Specifically for Delhi, where high concentrations of Oxides of Nitrogen, Oxides of Sulphur, Benzene, Toluene, Particulate Matter etc. are reported in its atmosphere. The presence of certain meteorological conditions in the atmosphere can be very helpful to identify the presence of such pollutants. Particulate matter with a diameter of 2.5 \(\upmu \)m or less (\(PM_{2.5}\)) is focused upon in this work. Data mining techniques like multivariate linear regression model and regression trees etc. to identify the relationship between meteorological features and air quality are deployed. Further, the use of ensemble techniques such as random forests are also given in the present research work. Evaluation is done over root mean square error metrics and results are found to be promising.

Save Icon
Up Arrow
Open/Close