Evaluating performance and determining optimum sample size for regression tree and automatic linear modeling

S Genç,M Mendeş

doi:10.1590/1678-4162-12413

Abstract

ABSTRACT This study was carried out for two purposes: comparing performances of Regression Tree and Automatic Linear Modeling and determining optimum sample size for these methods under different experimental conditions. A comprehensive Monte Carlo Simulation Study was designed for these purposes. Results of simulation study showed that percentage of explained variation estimates of both Regression Tree and Automatic Linear Modeling was influenced by sample size, number of variables, and structure of variance-covariance matrix. Automatic Linear Modeling had higher performance than Regression Tree under all experimental conditions. It was concluded that the Regression Tree required much larger samples to make stable estimates when comparing to Automatic Linear Modeling.

Highlights

Este estudo foi realizado com dois objetivos: comparar os desempenhos da Árvore de Regressão e da Modelagem Linear Automática e determinar o tamanho ideal da amostra para estes métodos sob diferentes condições experimentais
Results of simulation study showed that percentage of explained variation estimates of both Regression Tree and Automatic Linear Modeling was influenced by sample size, number of variables, and structure of variancecovariance matrix
Since the Classification and Regression Tree (CART) can statistically show which factors are important in a model or relationship in terms of explanatory power and variance, it has become more popular and especially it has been commonly used in multidisciplinary fields (Lin et al, 2008; Kaur and Pulugurta, 2008)

Summary

Introduction

Este estudo foi realizado com dois objetivos: comparar os desempenhos da Árvore de Regressão e da Modelagem Linear Automática e determinar o tamanho ideal da amostra para estes métodos sob diferentes condições experimentais. When the literatures is examined, it is seen that researchers generally try to compare the performances of different data mining techniques or machine learning algorithms through only one data set This is a widely used application, it is not sufficient for the reliability and stability of the results. Because there are many factors (such as p, n, correlation) that can affect the performances of these algorithms, and it will not be possible to investigate the effects of these factors when only a single data set is considered In light of these points, this study has basically two goals a) To compare the performances of Regression Tree and Automatic Linear Modeling under different experimental conditions via a comprehensive Monte Carlo Simulation Study and to determine which method gives more reliable results under which experimental conditions, and b) To determine optimum sample size

Methods

Results

Conclusion