In this paper, we discuss a rough set approach to missing attribute values. Among many ways of interpreting missing values, in this paper we focus on two interpretations, lost values and “do not care” conditions. Using these interpretations, global and saturated probabilistic approximations are constructed with two types of granules: characteristic sets and maximal consistent blocks. We compare eight approaches, combining two interpretations of missing attribute values, two types of probabilistic approximations with two types of granules using an error rate that is computed as a result of ten-fold cross-validation. Using a 5% level of statistical significance, we present the experimental results for these eight approaches, showing statistically significant differences between all approaches to mining incomplete data. The results also show that no one method and approach is the best for every data set and that all eight approaches should be attempted. The final section of the paper presents the idea of concept-compatible data sets. We show that for these types of data sets, global and saturated probabilistic approximations for a concept are identical to the concept. We also show that for an incomplete data set with no duplicate rows using the lost interpretation of missing attribute values, the data set is concept-compatible.
Read full abstract