Large-scale Regression Problems Research Articles

We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error (MSE) that decays as O(N-1 +(N/m)-2). Whenever m ≤ √N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O(N-1 + (N/m)-3), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O(N-1 + (N/m)-3/2), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4×108 samples and d ≈ 740,000 covariates.

Read full abstract

We consider applying Bayesian Variable Selection Regression, or BVSR, to genome-wide association studies and similar large-scale regression problems. Currently, typical genome-wide association studies measure hundreds of thousands, or millions, of genetic variants (SNPs), in thousands or tens of thousands of individuals, and attempt to identify regions harboring SNPs that affect some phenotype or outcome of interest. This goal can naturally be cast as a variable selection regression problem, with the SNPs as the covariates in the regression. Characteristic features of genome-wide association studies include the following: (i) a focus primarily on identifying relevant variables, rather than on prediction; and (ii) many relevant covariates may have tiny effects, making it effectively impossible to confidently identify the complete "correct" subset of variables. Taken together, these factors put a premium on having interpretable measures of confidence for individual covariates being included in the model, which we argue is a strength of BVSR compared with alternatives such as penalized regression methods. Here we focus primarily on analysis of quantitative phenotypes, and on appropriate prior specification for BVSR in this setting, emphasizing the idea of considering what the priors imply about the total proportion of variance in outcome explained by relevant covariates. We also emphasize the potential for BVSR to estimate this proportion of variance explained, and hence shed light on the issue of "missing heritability" in genome-wide association studies.

Read full abstract

Large-scale Regression Problems Research Articles

Related Topics

Articles published on Large-scale Regression Problems

New Efficient Approach to Solve Big Data Systems Using Parallel Gauss–Seidel Algorithms

StereoSpike: Depth Learning With a Spiking Neural Network

Tree-aggregated predictive modeling of microbiome data

Efficient and Scalable Multi-Task Regression on Massive Number of Tasks

Sparse network estimation for dynamical spatio-temporal array models

Large-Scale Regression: A Partition Analysis of the Least Squares Multisplitting

Hypergraph Learning and Reweighted $\ell _1$-Norm Minimization for Hyperspectral Unmixing

Approximate large-scale Bayesian spatial modeling with application to quantitative magnetic resonance imaging

A Highly Efficient Semismooth Newton Augmented Lagrangian Method for Solving Lasso Problems

METSK-HDe: A multiobjective evolutionary algorithm to learn accurate TSK-fuzzy systems in high-dimensional and large-scale regression problems

Communication-efficient algorithms for statistical optimization

Bayesian variable selection regression for genome-wide association studies and other large-scale problems

Fixed-size Least Squares Support Vector Machines: A Large Scale Application in Electrical Load Forecasting

10.1162/15324430152733142

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large-scale Regression Problems Research Articles

Related Topics

Articles published on Large-scale Regression Problems

New Efficient Approach to Solve Big Data Systems Using Parallel Gauss–Seidel Algorithms

StereoSpike: Depth Learning With a Spiking Neural Network

Tree-aggregated predictive modeling of microbiome data

Efficient and Scalable Multi-Task Regression on Massive Number of Tasks

Sparse network estimation for dynamical spatio-temporal array models

Large-Scale Regression: A Partition Analysis of the Least Squares Multisplitting

Hypergraph Learning and Reweighted $\ell _1$-Norm Minimization for Hyperspectral Unmixing

Approximate large-scale Bayesian spatial modeling with application to quantitative magnetic resonance imaging

A Highly Efficient Semismooth Newton Augmented Lagrangian Method for Solving Lasso Problems

METSK-HDe: A multiobjective evolutionary algorithm to learn accurate TSK-fuzzy systems in high-dimensional and large-scale regression problems

Communication-efficient algorithms for statistical optimization

Bayesian variable selection regression for genome-wide association studies and other large-scale problems

Fixed-size Least Squares Support Vector Machines: A Large Scale Application in Electrical Load Forecasting

10.1162/15324430152733142