Abstract
Association rules require models to understand their relationship to statistical properties of the data set. In this work, we study mathematical relationships between association rules and two fundamental techniques: clustering and correlation. Each cluster represents an important itemset. We show the sufficient statistics for clustering and correlation on binary data sets are the linear sum of points and the quadratic sum of points, respectively. We prove itemset support can be bounded and approximated from both models. Support bounds and support estimation obey the set downward closure property for fast bottom-up search for frequent itemsets. Both models can be efficiently computed with sparse matrix computations. Experiments with real and synthetic data sets evaluate model accuracy and speed. The clustering model is accurate to estimate support, given a sufficiently large number of clusters and it is more accurate than correlation, except for sets of two items. Accuracy increases as the number of clusters grows, but decreases as the minimum support threshold decreases. Once built, the clustering model represents a faster alternative than the traditional A-priori algorithm and the correlation model to mine associations. The correlation model is faster to compute than clustering, but it is less accurate. Time complexity to compute both models is linear on data set size, whereas dimensionality marginally impacts time when analyzing large transaction data sets.
Paper version not known (
Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have