Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs

Antonella Guzzo,Luigi Moccia,Domenico Saccà,Edoardo Serra

doi:10.1145/2541268.2541271

Abstract

Inverse frequent set mining (IFM) is the problem of computing a transaction database D satisfying given support constraints for some itemsets, which are typically the frequent ones. This article proposes a new formulation of IFM, called IFM I (IFM with infrequency constraints ), where the itemsets that are not listed as frequent are constrained to be infrequent; that is, they must have a support less than or equal to a specified unique threshold. An instance of IFM I can be seen as an instance of the original IFM by making explicit the infrequency constraints for the minimal infrequent itemsets, corresponding to the so-called negative generator border defined in the literature. The complexity increase from PSPACE (complexity of IFM) to NEXP (complexity of IFM I ) is caused by the cardinality of the negative generator border, which can be exponential in the original input size. Therefore, the article introduces a specific problem parameter κ that computes an upper bound to this cardinality using a hypergraph interpretation for which minimal infrequent itemsets correspond to minimal transversals. By fixing a constant k , the article formulates a k -bounded definition of the problem, called k -IFM I , that collects all instances for which the value of the parameter κ is less than or equal to k —its complexity is in PSPACE as for IFM. The bounded problem is encoded as an integer linear program with a large number of variables (actually exponential w.r.t. the number of constraints), which is thereafter approximated by relaxing integer constraints—the decision problem of solving the linear program is proven to be in NP. In order to solve the linear program, a column generation technique is used that is a variation of the simplex method designed to solve large-scale linear programs, in particular with a huge number of variables. The method at each step requires the solution of an auxiliary integer linear program, which is proven to be NP hard in this case and for which a greedy heuristic is presented. The resulting overall column generation solution algorithm enjoys very good scaling as evidenced by the intensive experimentation, thereby paving the way for its application in real-life scenarios.

Full Text