Abstract

Analysis of large volumes of data is very complex due to not only a high level of skewness and heteroscedasticity of variance but also the phenomenon of missing data. Expectile regression is a popular alternative method of analyzing heterogeneous data. In this paper, we consider fitting a linear expectile regression model for estimating conditional expectiles based on a large quantity of data with covariates missing at random. We construct a communication-efficient surrogate loss (CSL) function to estimate model parameters. The asymptotic normality of the proposed estimator is established. A proximal alternating direction method of multipliers (ADMM) algorithm is developed for distributed statistical optimization on a large quantity of data. Simulation studies are performed to assess the finite-sample performance of the proposed method. Survey data from the Behavioral Risk Factor Surveillance System (BRFSS) is used to demonstrate the utility of the proposed method in practice.

Highlights

  • Large-scale data, which arise in many fields such as online surveys, genomics and economics, are characterized by a high level of skewness, heteroscedasticity of variance and the phenomenon of missing information

  • We study a distributed optimization approach to analyzing large-scale data based on expectile regression with covariates missing at random

  • We study an efficient approach in an expectile regression framework for analyzing large-scale data with covariates missing at random

Read more

Summary

INTRODUCTION

Large-scale data, which arise in many fields such as online surveys, genomics and economics, are characterized by a high level of skewness, heteroscedasticity of variance and the phenomenon of missing information. Y. Pan et al.: Large-Scale Expectile Regression With Covariates Missing at Random performed [16], [25]. We study a distributed optimization approach to analyzing large-scale data based on expectile regression with covariates missing at random. The CSL function can be regarded as a communication-efficient surrogate for the weighted global loss function, and can effectively solve the problems caused by large-scale data stored randomly on multiple machines. To establish the asymptotic properties of the proposed estimator, we apply the distributed optimization theory [10] and the Lindeberg-Feller central limit theorem Another challenge arises from the numerical calculation of the proposed estimator. The proofs of asymptotic properties are given in the Appendix

DESIGN AND ESTIMATION
ASYMPTOTIC PROPERTIES
PROXIMAL ADMM ALGORITHM
SIMULATION STUDIES
12: Update
DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call