Abstract

We study random design linear regression with no assumptions on the distribution of the covariates and with a heavy-tailed response variable. In this distribution-free regression setting, we show that boundedness of the conditional second moment of the response given the covariates is a necessary and sufficient condition for achieving nontrivial guarantees. As a starting point, we prove an optimal version of the classical in-expectation bound for the truncated least squares estimator due to Gy\"{o}rfi, Kohler, Krzy\.{z}ak, and Walk. However, we show that this procedure fails with constant probability for some distributions despite its optimal in-expectation performance. Then, combining the ideas of truncated least squares, median-of-means procedures, and aggregation theory, we construct a non-linear estimator achieving excess risk of order $d/n$ with an optimal sub-exponential tail. While existing approaches to linear regression for heavy-tailed distributions focus on proper estimators that return linear functions, we highlight that the improperness of our procedure is necessary for attaining nontrivial guarantees in the distribution-free setting.

Highlights

  • In the random design regression problem, one has access to n input-output pairs (Xi, Yi) ∈ Rd × R sampled i.i.d. from some unknown distribution P

  • Since the risk is relative to the problem difficulty, it is customary to compare it with the best possible risk achievable via some reference class of functions; in this work, we mainly focus on the class of all linear functions Flin = { w, · : w ∈ Rd}

  • One can assume without loss of generality that the infimum above is attained by some linear function w∗, ·, where w∗ ∈ Rd

Read more

Summary

Introduction

In the random design regression problem, one has access to n input-output pairs (Xi, Yi) ∈ Rd × R sampled i.i.d. from some unknown distribution P. If we only impose Assumption 1, any statistical estimator that selects predictors from Flin (such an estimator is called proper ) is bound to fail This fact can be established using the recent result of Shamir [69, Theorem 3], and it remains true even when d = 1 and the response variable Y is bounded almost surely. This observation separates our setup from the existing literature where only proper estimators are studied for convex classes such as Flin even in the heavy-tailed scenarios (see, for example, [14, 47, 53, 54]).

Summary of contributions and structure of the paper
Related work
Notation
Distribution-free linear regression: known results
An improved bound for truncated least squares
Failure of previous estimators with constant probability
An optimal robust estimator in the high-probability regime
Warm-up: known covariance structure
Deviation-optimal robust estimator
Some extensions of Theorem 3
Statistical lower bounds and the necessity of Assumption 1
Deferred Proofs
Proof of Lemma 1
Proof of Lemma 2
Proof of Lemma 3
Findings
Proof of Lemma 4
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call