Abstract
We investigate the problem of tuning and selecting among interestingness measures for association rules. We first derive a parametric normalization factor for such measures that addresses imbalanced itemset sizes, and show how it can be generalized across many previously derived measures. Next, we develop a validationbased framework for both the normalization and selection tasks, based upon mutual information measures over attributes. We then apply this framework to market basket data and user profile data in weblogs, to automatically choose among or fine-tune alternative measures for generating and ranking rules. Finally, we show how the derived normalization factor can significantly improve the sensitivity of interestingness measures when used for pure association rule mining and also for a classification task. We also consider how this data-driven approach can be used for fusion of association rule sets: either those elicited from subject matter experts, or those found using prior background knowledge. INTRODUCTION One of the most important aspects of association rule mining is ranking rules by their significance, according to some quantitative measure that expresses their interestingness with respect to a decision support or associative reasoning task. Rules take the form X → Y, where both X and Y are subsets of an observed itemset L = {I1, I2, ..., Ik}. Two well-known measures for association rule interestingness are the support, P(X) and the confidence, P(Y | X). These probabilistic measures have been used with other statistical formulae to derive compound measures used in discovering the most significant rule. One limitation of existing binary measures of rule interestingness is that they do not account for the relative size of the itemsets to which each candidate pair of associated subsets (X, Y) belongs. Moreover, there are some hidden associations related to candidates appearing in small groups. Thus, giving some attention and weight to these small groups may lead us to a different relationship perspective. This kind of data behavior can be seen, for example, in social network data where each user record consists of features such as interests, communities, schools attended, etc. In particular, user’s list of interests, each of which corresponds to a list of interest holders. Some interests such as “DNA replication” have low membership; whether this is because the interests are less popular or more specialized, it often suggests a more significant association between users naming them than between those who have interests such as “Music” or “Games” in
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.