Nature create variables using its character component, and variables are sharing characters from a vary small to relatively large scale. This results, variables to have from a vary different to a more similar character, and leads to have a relation ship. Literature suggested different relation measures based on the nature of variable and type of relation ship exist. Today, due to having high variety of frequently produced large data size, currently suggested variable filtering and selection methods have gaps to full fill the need. This research desires to fill this gap by comparing literature suggested methods to finding out a better variable selection and dimension reduction methods. The result from regression analysis using all literature suggested factors shows that none of the predictors for development status of enterprise are significant, and only 10 predictors for number of employer in an enterprise are significant out of 81 factors. Since, variable selection and dimension reduction methods are applied to find out predictors of a response by removing variable redundancy, and complexity of incorporating large number variable. Based on statistical power, for the results from variable selection methods, specially association and correlation methods showed that, CANOVA more efficiently detects non-linear or non-monotonic correlation between a continuous–continuous and a continuous-categorical variables. Spearman’s correlation coefficient more efficiently detects a monotonic correlation between a continuous with a continuous, and a continuous with a categorical variable. Pearson correlation coefficient more efficiently detects the linear correlation between continuous variables. MIC efficiently detects non-linear or non-monotonic relation between continuous variables. Chi-square test of independence efficiently detects relation between a continuous with a continuous, and categorical with categorical variables, but the non linear or non monotonic relation between a continuous with a categorical are not well detected. On the other hand, the result from lasso and stepwise methods reveals that, the relation between the predictor and response due to interaction effect not detected by correlation and association methods are detected by stepwise variable selection method, and the multicollinearity is detected and removed by lasso method. Regressing the response variable “number of employer in an enterprise” based on variables selected by lasso and stepwise method does bring greater model fitness (based on adjusted R-squared value) than variables selected by association and correlation methods. Similarly, regressing the response variable “development status of an enterprise” based on variables selected by association and correlation methods does bring 12 significant variables, where none of variables are significant from variables selected by lasso and stepwise methods. As a result, 51 predictors for number of employment in an enterprise, and 40 predictors for development status of an enterprise are detected as significantly related variables. And, lasso and stepwise methods are preferred to select predictors of a continuous response variable “number of employers in an enterprise”, and association and correlation methods are preferred to select predictors of a categorical response variable “development status of an enterprise”. Finally, the reduced regression models result reveals that, 20 predictors have causal relation with number of employment in an enterprise, and 12 predictors have causal relation with development status of an enterprise. On the other hand, based on model fitness, information lost, and number of significant factors, principal factor is preferred and applied in dimension reduction for a categorical response variable “development status of an enterprise”, and factor score based regression is preferred and applied for a continuous response variable “number of employers in an enterprise”. However, the comparison of the results in variable selection and dimension reduction indicates that, variable selection methods gave more gain in model fitness than dimension reduction methods. Hence, the suggested variable selection methods are more preferred than dimension reduction methods, and applied to find out predictors. In general, the suggested procedure for variable selection methods are recommended when small number of variables are studied, and the suggested dimension reduction methods are recommended for large number of variant variables (Big data case).
Read full abstract