Bayesian Variable Selection in Linear Regression in One Pass for Large Data Sets.

Carlos Ordonez,Veerabhadaran Baladandayuthapani,Carlos Garcia-Alvarado

doi:10.1145/2629617

Carlos Ordonez, Veerabhadaran Baladandayuthapani + Show 1 more

Open Access

https://doi.org/10.1145/2629617

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Bayesian models are generally computed with Markov Chain Monte Carlo (MCMC) methods. The main disadvantage about MCMC methods is the large number of iterations they need to sample the posterior distributions of model parameters, especially for large data sets. On the other hand, variable selection remains a challenging problem due to its combinatorial search space, where Bayesian models are a promising solution. In this work, we study how to accelerate Bayesian model computation for variable selection in linear regression. We propose a fast Gibbs sampler algorithm, a widely used MCMC method, that incorporates several optimizations. We use non-informative and conjugate prior distributions on several model parameters, which enable data set summarization in one pass exploiting an augmented set of sufficient statistics. Thereafter the algorithm can iterate in main memory. Sufficient statistics are indexed with a sparse binary vector to efficiently compute matrix projections based on selected variables. Discovered variable subsets probabilities, selecting and discarding each variable, are stored on a hash table for fast retrieval in future iterations. We study how to integrate our algorithm into a database management system (DBMS), exploiting aggregate User-Defined Functions for parallel data summarization and stored procedures to manipulate matrices with arrays. An experimental evaluation with real data sets evaluates accuracy and time performance, comparing our DBMS-based algorithm, with the R package. Our algorithm is shown to produce accurate results, scale linearly on data set size and run orders of magnitude faster than the R package.

Full Text