Abstract
SummaryWe propose a Bayesian variable selection approach for ultrahigh dimensional linear regression based on the strategy of split and merge. The approach proposed consists of two stages: split the ultrahigh dimensional data set into a number of lower dimensional subsets and select relevant variables from each of the subsets, and aggregate the variables selected from each subset and then select relevant variables from the aggregated data set. Since the approach proposed has an embarrassingly parallel structure, it can be easily implemented in a parallel architecture and applied to big data problems with millions or more of explanatory variables. Under mild conditions, we show that the approach proposed is consistent, i.e. the true explanatory variables can be correctly identified by the approach as the sample size becomes large. Extensive comparisons of the approach proposed have been made with penalized likelihood approaches, such as the lasso, elastic net, sure independence screening and iterative sure independence screening. The numerical results show that the approach proposed generally outperforms penalized likelihood approaches: the models selected by the approach tend to be more sparse and closer to the true model.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of the Royal Statistical Society Series B: Statistical Methodology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.