Abstract
Missing or incomplete data sets are a common problem in data mining. To deal with structured data of this type, the interpretation of attribute values is a contributing factor to the resulting accuracy as well as complexity of the rule sets induced. In this paper, lost values and “do not care” conditions are studied as a representation for the missing values. Further study is conducted with global and saturated approximations, two new types of probabilistic approximations. These approaches are combined to produce four primary data mining experiments; rule induction with two types of approximations and two interpretations of missing attribute values. The main objective of this work is to compare the complexity of the induced rule sets by the four approaches to find the lowest complexity of rules. This is a complement to previous research where experimental evidence show that none of the four approaches induces rules with the lowest error in all scenarios, and it depends on the data set being mined. The result of this paper’s experiments in complexity show that using the “do not care” condition provides simpler rules sets than the lost value interpretation of missing attribute values. Furthermore, there is not statistically significant differences in complexity between using global or saturated probabilistic approximations.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.