Complexity of Rule Sets Mined from Incomplete Data Using Probabilistic Approximations Based on Characteristic Sets

Teresa Mroczek,Patrick G Clark,Jerzy W Grzymala-Busse,Zdzislaw S Hippe

doi:10.1016/j.procs.2023.10.029

Abstract

Real data sets are often incomplete. In mining such data it is important to identify the types of missing data, such as missing completely at random, missing at random, and missing not at random. In this paper we focus on two interpretation of missing values: lost values and “do not care” conditions. Using those interpretations global and saturated probabilistic approximations are constructed from characteristic sets. Thus four different data mining methods: two kinds of missing attribute values and two kinds of probabilistic approximations are considered. In our previous study, it was shown that pairwise differences in an error rate, evaluated by ten-fold cross validation between these four methods of data mining are statistically insignificant (5% level of significance). Hence, we explore the next important problem: when the rule set complexity is the smallest. We show that the rule set complexity is the smallest when missing attribute values are interpreted as “do not care” conditions. The difference between using both kinds of probabilistic approximations is insignificant.

Full Text