Abstract
Variable selection is an essential and necessary task in the statistical modeling field. Several studies have triedto develop and standardize the process of variable selection, but it isdifficultto do so. The first question a researcher needs to ask himself/herself what are the most significant variables that should be used to describe a given dataset’s response. In thispaper, a new method for variable selection using Gibbs sampler techniqueshas beendeveloped.First, the model is defined, and the posterior distributions for all the parameters are derived.The new variable selection methodis tested usingfour simulation datasets. The new approachiscompared with some existingtechniques: Ordinary Least Squared (OLS), Least Absolute Shrinkage and Selection Operator (Lasso), and Tikhonov Regularization (Ridge). The simulation studiesshow that the performance of our method is better than the othersaccording to the error and the time complexity. Thesemethodsare applied to a real dataset, which is called Rock StrengthDataset.The new approach implemented using the Gibbs sampler is more powerful and effective than other approaches.All the statistical computations conducted for this paper are done using R version 4.0.3 on a single processor computer.
Highlights
Forward and backward selection methods are used to select the best subsets of variables by following some steps[3].These methods are slow with large datasets[4]
The true values of the parameters are close to the estimated parameters.Thecovariates associated withβ[0], β2, β4, β6and β7were selected as the most significant covariates because they were close to the true model coefficients, as showninTable 1.in Least Absolute Shrinkage and Selection Operator (Lasso) and Ridge methods,all the covariateswere selected as important variables.Computationally, selecting all the variables as important variables is inefficient because both the error and time willincrease for the large datasets
Gibbs samplerhas been discussed in this article.The posterior distributions for βand σ2have been derived, andthe Gibbs sampler algorithmis used to sample from the corresponding distributions
Summary
Simulated samples arethinned at every 5thsample to reduce the correlation between the samples.Both Gibbs sampler and Lasso methodsare used to identifythe most important variables from the 8 variables.Parameters aresummarized from their corresponding posterior means, and some of themarevery good estimatorsof the corresponding true value. The true values of the parameters are close to the estimated parameters.Thecovariates associated withβ[0], β2, β4, β6and β7were selected as the most significant covariates because they were close to the true model coefficients, as showninTable 1.in Lasso and Ridge methods,all the covariateswere selected as important variables.Computationally, selecting all the variables as important variables is inefficient because both the error and time willincrease for the large datasets. Boxplots are plotted.InFig.3b, some outliers in the dataset are realized.So,they areremoved before running Gibbsand Lasso variables selection methods.The correlation matrix for the 8 predictors in the real data set (RSD) is given in Fig.[4]. Figure 4.correlation matrix for the 8 predictors in SRD and their distributions
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.