Abstract

We derandomize Valiant’s (J ACM 62, Article 13, 2015) subquadratic-time algorithm for finding outlier correlations in binary data. This demonstrates that it is possible to perform a deterministic subquadratic-time similarity join of high dimensionality. Our derandomized algorithm gives deterministic subquadratic scaling essentially for the same parameter range as Valiant’s randomized algorithm, but the precise constants we save over quadratic scaling are more modest. Our main technical tool for derandomization is an explicit family of correlation amplifiers built via a family of zigzag-product expanders by Reingold et al. (Ann Math 155(1):157–187, 2002). We say that a function f:{-1,1}^drightarrow {-1,1}^D is a correlation amplifier with threshold 0le tau le 1, error gamma ge 1, and strength p an even positive integer if for all pairs of vectors x,yin {-1,1}^d it holds that (i) |langle x,yrangle |<tau d implies |langle f(x),f(y)rangle |le (tau gamma )^pD; and (ii) |langle x,yrangle |ge tau d implies left (frac{langle x,yrangle }{gamma d}right )^pD le langle f(x),f(y)rangle le left (frac{gamma langle x,yrangle }{d}right )^pD.

Highlights

  • We consider the task of identifying outlier-correlated pairs from large collections of weakly correlated binary vectors in {−1, 1}d

  • The main result of this paper is that sufficiently powerful explicit amplifiers exist to find outlier correlations in deterministic subquadratic time

  • As a corollary we obtain a deterministic algorithm for finding outlier correlations in subquadratic time using bucketing and fast matrix multiplication

Read more

Summary

Introduction

We consider the task of identifying outlier-correlated pairs from large collections of weakly correlated binary vectors in {−1, 1}d. Our task is to output all outlier pairs (x, y) ∈ X × Y with ⟨x, y⟩ ≥ d , subject to the assumption that at most q of the pairs (x, y) ∈ X × Y satisfy ⟨x, y⟩ > τd Remark This setting of binary vectors and (Pearson) correlation is directly motivated, among others, by the connection to Hamming distance. Our interest is in algorithms that avoid such a “curse of weak outliers” and run in subquadratic time essentially independently of the magnitude of , provided that is sufficiently separated from. Such ability to identify weak outliers from large amounts of data is useful, among others, in machine learning from noisy data. A strategy of this form is oblivious to q until we start searching inside the buckets, which enables adjusting and based on the number of large aggregate inner products

Randomized Amplification
Explicit Amplification
Our Results
Overview and Discussion of Techniques
Related Work and Applications
Preliminaries
Explicit Amplifiers by Approximate Squaring
Preliminaries on Expansion and Mixing
Main Construction
Copy‐and‐Truncate Preprocessing of the Input Dimension
Completing the Proof of Theorem 1
The Algorithm
Parameterization and Correctness
Running Time
The Light Bulb Problem
Learning Parities with Noise
Nonconstructive Existence and a Lower Bound
Low‐Dimensional Amplifiers Exist

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.