Abstract

Logical Analysis of Data is a procedure aimed at identifying relevant features in data sets with both positive and negative samples. The goal is to build Boolean formulas, represented by strings over {0,1,-} called patterns, which can be used to classify new samples as positive or negative. Since a data set can be explained in alternative ways, many computational problems arise related to the choice of a particular set of patterns. In this paper we study the computational complexity of several of these pattern problems (showing that they are, in general, computationally hard) and we propose some integer programming models that appear to be effective. We describe an ILP model for finding the minimum-size set of patterns explaining a given set of samples and another one for the problem of determining whether two sets of patterns are equivalent, i.e., they explain exactly the same samples. We base our first model on a polynomial procedure that computes all patterns compatible with a given set of samples. Computational experiments substantiate the effectiveness of our models on fairly large instances. Finally, we conjecture that the existence of an effective ILP model for finding a minimum-size set of patterns equivalent to a given set of patterns is unlikely, due to the problem being NP-hard and co-NP-hard at the same time.

Highlights

  • One of the main consequences of the constant progress of technology together with the massive use of computers in many aspects of our lives has been the creation of large repositories of data storing information of all sorts

  • In this paper we focus on some mathematical issues that arise from data mining problems

  • A very common situation for data mining problems is to represent the starting information by a two-dimensional array, in which the rows correspond to samples while the columns correspond to their characteristics

Read more

Summary

Introduction

One of the main consequences of the constant progress of technology together with the massive use of computers in many aspects of our lives has been the creation of large repositories of data storing information of all sorts. Finding a min-size set of patterns which cover a given set of vectors is called the Pattern Cover Minimality problem. Other problems arising from the analysis of patterns are related to understanding whether two different sets of rules explain the same data set, or, in other words, the two pattern sets are equivalent. In particular we would like to know whether a given set of rules explains all possible data, and so is in some sense “useless”. Given a set of patterns we would like to know whether there exists another smaller set of patterns that explains the same data set

Basic Definitions
Computational Complexity Results
Compatible Patterns
ILP Models
ILP for Pattern Cover Minimality
ILP for Pattern Equivalence
Computational Experiments
Pattern Cover Minimality
Pattern Equivalence
Diagonal Instances
Generating Equivalent Pattern Sets in General
How to Boost Instances of Pattern Equivalence
Experiments
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call