Abstract
Analysis of large volumes of data is very complex due to not only a high level of skewness and heteroscedasticity of variance but also the phenomenon of missing data. Expectile regression is a popular alternative method of analyzing heterogeneous data. In this paper, we consider fitting a linear expectile regression model for estimating conditional expectiles based on a large quantity of data with covariates missing at random. We construct a communication-efficient surrogate loss (CSL) function to estimate model parameters. The asymptotic normality of the proposed estimator is established. A proximal alternating direction method of multipliers (ADMM) algorithm is developed for distributed statistical optimization on a large quantity of data. Simulation studies are performed to assess the finite-sample performance of the proposed method. Survey data from the Behavioral Risk Factor Surveillance System (BRFSS) is used to demonstrate the utility of the proposed method in practice.
Highlights
Large-scale data, which arise in many fields such as online surveys, genomics and economics, are characterized by a high level of skewness, heteroscedasticity of variance and the phenomenon of missing information
We study a distributed optimization approach to analyzing large-scale data based on expectile regression with covariates missing at random
We study an efficient approach in an expectile regression framework for analyzing large-scale data with covariates missing at random
Summary
Large-scale data, which arise in many fields such as online surveys, genomics and economics, are characterized by a high level of skewness, heteroscedasticity of variance and the phenomenon of missing information. Y. Pan et al.: Large-Scale Expectile Regression With Covariates Missing at Random performed [16], [25]. We study a distributed optimization approach to analyzing large-scale data based on expectile regression with covariates missing at random. The CSL function can be regarded as a communication-efficient surrogate for the weighted global loss function, and can effectively solve the problems caused by large-scale data stored randomly on multiple machines. To establish the asymptotic properties of the proposed estimator, we apply the distributed optimization theory [10] and the Lindeberg-Feller central limit theorem Another challenge arises from the numerical calculation of the proposed estimator. The proofs of asymptotic properties are given in the Appendix
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.