Abstract

High-dimensional datasets often display heterogeneity due to heteroskedasticity or other forms of non-location-scale covariance effects. When the size of datasets becomes very large, it may be infeasible to store all of the high-dimensional datasets on one machine, or at least to keep the datasets in memory. In this paper, we consider penalized expectile regression using smoothly clipped absolute deviation (SCAD) and adaptive LASSO penalties, which can effectively detect the heteroskedasticity of high-dimensional data. We propose a communication-efficient approach for distributed sparsity learning, where observations are randomly partitioned across machines. By selecting the appropriate tuning parameters, we show that the proposed estimators display oracle properties. Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our proposed method.

Highlights

  • The explosive growth in the size of modern datasets has stimulated interest in distributed statistical learning [2], [4], [29]

  • Expectile regression is a popular alternative when working with heterogeneous data and studies the overall conditional distribution of response to a given predictor in a heterogeneous environment

  • We propose penalized large-scale expectile regression using smoothly clipped absolute deviation (SCAD) and adaptive LASSO penalties and demonstrate the oracle properties introduced in Fan and Li [6] and Zou [31]

Read more

Summary

INTRODUCTION

The explosive growth in the size of modern datasets has stimulated interest in distributed statistical learning [2], [4], [29]. This paper attempts to study a distributed optimization approach for large-scale expectile regression with SCAD [6] and adaptive LASSO [31] penalties. We propose a distributed estimator based on the GEL function with SCAD and adaptive LASSO penalties and apply the ideas of Zhao and Zhang [30] and Jordan et al [11] to prove the oracle properties of penalized expectile regression with independent identically distributed random error. Simulation and empirical studies show that the estimation errors (or the prediction errors) and variable selection results obtained by the proposed approach are compared with those obtained by the centralized method, and are better than the results of Pan [18] which is based on the CSL function They show that our proposed method can effectively solve the problem of data heterogeneity, and reduce the cost of data storage and transmission.

DISTRIBUTED ESTIMATION IN LARGE-SCALE
ASYMPTOTIC PROPERTIES
ADAPTIVE LASSO
AUGMENTED PROXIMAL ADMM ALGORITHM
12: The CPU computes m
SIMULATION STUDIES
REAL DATA ANALYSIS
CONCLUSION AND DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call