Finding Subcube Heavy Hitters in Analytics Data Streams

Branislav Kveton,Yikun Xian,Hoa T Vu,S Muthukrishnan

doi:10.1145/3178876.3186082

Abstract

Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem is to find frequent joint values (heavy hitters) of subsets of dimensions. Formally, the data stream consists of d-dimensional items and a \em k-dimensional subcube T is a subset of k distinct coordinates. Given a theshold γ, a \em subcube heavy hitter query $\rm Query (T,v)$ outputs YES if $f_T(v) \geq γ$ and NO if $f_T(v) ∠ γ/4$ where $f_T$ is the ratio of the number of stream items whose coordinates T have joint values v. The \em all subcube heavy hitters query $\rm AllQuery(T)$ outputs all joint values v that return YES to $\rm Query (T,v)$. The problem is to answer these queries correctly for all T and v. We present a simple one-pass sampling algorithm to solve the subcube heavy hitters problem in $\tildeO (kd/γ)$ space. $\tildeO(\cdot)$ suppresses polylogarithmic factors. This is optimal up to polylogarithmic factors based on the lower bound of Liberty et al. In the worst case, this bound becomes Θ(d^2/γ)$ which is prohibitive for large d. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass, $\tildeO (d/γ)$-space algorithm for our problem, and a fast algorithm for answering $\rm AllQuery (T)$ in $\tO((k/γ)^2)$ time. We demonstrate the effectiveness of our approach on a synthetic dataset as well as real datasets from Adobe and Yandex. Our work shows the potential of model-based approach to data streams.

Full Text