Abstract

We study the problem of detecting a structured, low-rank signal matrix corrupted with additive Gaussian noise. This includes clustering in a Gaussian mixture model, sparse PCA, and submatrix localization. Each of these problems is conjectured to exhibit a sharp information-theoretic threshold, below which the signal is too weak for any algorithm to detect. We derive upper and lower bounds on these thresholds by applying the first and second moment methods to the likelihood ratio between these “planted models” and null models where the signal matrix is zero. For sparse PCA and submatrix localization, we determine this threshold exactly in the limit where the number of blocks is large or the signal matrix is very sparse; for the clustering problem, our bounds differ by a factor of $\sqrt {2}$ when the number of clusters is large. Moreover, our upper bounds show that for each of these problems there is a significant regime where reliable detection is information-theoretically possible but where known algorithms such as PCA fail completely, since the spectrum of the observed matrix is uninformative. This regime is analogous to the conjectured “hard but detectable” regime for community detection in sparse graphs.

Highlights

  • Many problems in machine learning, signal processing, and statistical inference have a common, unifying goal: reconstruct a low-rank signal matrix observed through a noisy channel

  • Our results verify the conjecture that the information-theoretic threshold dips below the spectral one when the signal is sufficiently sparse, or when the number of clusters or blocks is sufficiently large. This corresponds to recent results [1, 9, 17, 23] showing that, in the stochastic block model, the information-theoretic detectability threshold falls below the Kesten-Stigum bound above which efficient spectral and message-passing algorithms succeed [19, 18, 39, 33, 14]

  • We find that if the number of clusters is large, clustering is informationally possible even when below the spectral phase transition threshold, and we conjecture that in this regime it is computationally hard to identify the clusters

Read more

Summary

Introduction

Many problems in machine learning, signal processing, and statistical inference have a common, unifying goal: reconstruct a low-rank signal matrix observed through a noisy channel. Our results verify the conjecture that the information-theoretic threshold dips below the spectral one when the signal is sufficiently sparse, or when the number of clusters or blocks is sufficiently large This corresponds to recent results [1, 9, 17, 23] showing that, in the stochastic block model, the information-theoretic detectability threshold falls below the Kesten-Stigum bound above which efficient spectral and message-passing algorithms succeed [19, 18, 39, 33, 14]. Their characterization of reconstruction thresholds does not directly apply to detection

Sparse PCA
Submatrix Localization
Gaussian Mixture Clustering
The likelihood ratio and hypothesis testing
Second moment bounds and contiguity
Conditional second moment method
Non-reconstructibility
Notation and preliminary lemmas
First moment upper bound for sparse PCA
Second moment lower bound for sparse PCA
Conditional second moment lower bound for sparse PCA
First moment upper bound for submatrix localization
Second moment lower bound for submatrix localization
Conditional second moment lower bound for submatrix localization
First moment upper bound for Gaussian mixture clustering
Second moment lower bound for Gaussian mixture clustering
Proof of Theorem 4
Proof of Theorem 5
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.