Abstract

An experiment was conducted to investigate the effect of the inconsistency rate (IR) of granulated datasets on classification performance. Unsupervised (equal-width interval, EWI) and supervised (minimum description length, MDL) techniques were used to granulate 36 datasets. An algorithm was developed to divide the original granulated datasets into consistent and inconsistent subsets. Five classifiers including one simple tree-based and four ensemble-based on datasets before granulation (BG), after granulation but before removal of inconsistent granulated datasets (AGBR), and after removal of inconsistent granulated datasets (AR) were used, followed by testing and comparisons of predication accuracy (PA). The experimental results showed the following: (1) 24 out of 36 via EWI and 28 out of 36 via MDL datasets contain inconsistent datasets. (2) PA of AR is more likely higher than of BG and AGBR datasets with both EWI and MDL by all classifiers. (3) Mean PA improvement ranges from 5.74% to 10.01% with EWI and from 8.74% to 13.73% with MDL. (4) The correlation coefficient between IR and PA improvement ranges from 0.7413 to 0.7901 with EWI and 0.7870 to 0.9683 with MDL. These results demonstrate the value of uncovering the effect of IR on classification performance in the domain of machine learning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call