
We study the problem of selecting a subset of k random variables to observe that will yield the best linear prediction of another variable of interest, given the pairwise correlations between the observation variables and the predictor variable. Under preserving reductions, this problem is equivalent to the sparse approximation problem of approximating signals concisely. The subset selection problem is NP-hard in general; in this paper, we propose and analyze exact and algorithms for several special cases of practical interest. Specifically, we give an FPTAS when the covariance matrix has constant bandwidth, and exact algorithms when the associated covariance graph, consisting of edges for pairs of variables with non-zero correlation, forms a tree or has a large (known) independent set. Furthermore, we give an exact algorithm when the variables can be embedded into a line such that the covariance decreases exponentially in the distance, and a constant-factor when the variables have no conditional suppressor variables. Much of our reasoning is based on perturbation results for the R2 multiple correlation measure, which is frequently used as a natural measure for goodness-of-fit statistics. It lies at the core of our FPTAS, and also allows us to extend our exact algorithms to algorithms when the matrix falls into one of the above classes. We also use our perturbation analysis to prove guarantees for the widely used Forward Regression heuristic under the assumption that the observation variables are nearly independent.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call