Abstract

The current state of the art in supervised descriptive pattern mining is very good in automatically finding subsets of the dataset at hand that are exceptional in some sense. The most common form, subgroup discovery, generally finds subgroups where a single target variable has an unusual distribution. Exceptional model mining (EMM) typically finds subgroups where a pair of target variables display an unusual interaction. What these methods have in common is that one specific exceptionality is enough to flag up a subgroup as exceptional. This, however, naturally leads to the question: can we also find multiple instances of exceptional behaviour simultaneously in the same subgroup? This paper provides a first, affirmative answer to that question in the form of the SPEC (Subsets of Pairwise Exceptional Correlations) model class for EMM. Given a set of predefined numeric target variables, SPEC will flag up subgroups as interesting if multiple target pairs display an unusual rank correlation. This is a fundamental extension of the EMM toolbox, which comes with additional algorithmic challenges. To address these challenges, we provide a series of algorithmic solutions whose strengths/flaws are empirically analysed.

Highlights

  • W E are living in a Golden Age of data science, where data mining techniques designed to discover valuable insights from a collection of records [1] are employed to transform tons of facts into useful information in fields as diverse as education [2], health care [3], and Internet of Things [4]

  • This deviation is quantified according to different measures: in terms of a relatively high/low occurrence, which is known as frequent/infrequent itemset mining [7], or an unusual distribution for a specific target variable, known as Subgroup Discovery (SD) [8], or even considering patterns of high utility for a specific aim [9]

  • It means that when a house includes when this assertion is fine, this knowledge is partial and does a full finished basement, even when it is not located in a not illustrate all the information that can be obtained, i.e. preferred neighbourhood and it lacks of some extras, the price vides additional unusual interactions among target variables. is not related to the lot size since extra square meters are Windsor Housing dataset

Read more

Summary

INTRODUCTION

W E are living in a Golden Age of data science, where data mining techniques designed to discover valuable insights from a collection of records [1] are employed to transform tons of facts into useful information in fields as diverse as education [2], health care [3], and Internet of Things [4]. Taking the example widely used in EMM about the analysis of the housing price per square meter [11], the general know-how is that a larger size of the lot coincides with a higher sales price At this point, an investor might wonder whether it is possible to find specific data subsets where the price of an additional square meter is significantly less than the norm, or even zero. SPEC fundamentally extends the typical EMM toolbox, and as such, requires fundamental algorithmic contributions as well, which this paper will provide In this regard, the contribution of this research work can be summarized as follows: 1) SPEC describes reasons to understand the cause of unusual interaction among multiple targets in data. The contribution of this research work can be summarized as follows: 1) SPEC describes reasons to understand the cause of unusual interaction among multiple targets in data It looks for good descriptors extracting interesting subsets of data on contrasting scenarios.

PRELIMINARIES
TASK COMPLEXITY
26: More than β unusual interactions
30: Update set P with D considering the maximum number n of solutions
EXPERIMENTAL ANALYSIS
DATASETS AND EXPERIMENTAL SET-UP
ANALYSIS OF THE GENETIC OPERATORS
ANALYSIS OF THE PERFORMANCE
LESSONS LEARNT
CONCLUSION
Findings
REFERENCE EXAMPLES REFERENCES
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call