Merging of Numerical Intervals in Entropy-Based Discretization.

Jerzy Grzymala-Busse,Teresa Mroczek

doi:10.3390/e20110880

Abstract

As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the domains of the data records. In multiple-scanning discretization, the last step is the merging of neighboring intervals in discretized datasets as a kind of postprocessing. Our objective is to check how the error rate, measured by tenfold cross validation within the C4.5 system, is affected by such merging. We conducted experiments on 17 numerical datasets, using the same setup of multiple scanning, with three different options for merging: no merging at all, merging based on the smallest entropy, and merging based on the biggest entropy. As a result of the Friedman rank sum test (5% significance level) we concluded that the differences between all three approaches are statistically insignificant. There is no universally best approach. Then, we repeated all experiments 30 times, recording averages and standard deviations. The test of the difference between averages shows that, for a comparison of no merging with merging based on the smallest entropy, there are statistically highly significant differences (with a 1% significance level). In some cases, the smaller error rate is associated with no merging, in some cases the smaller error rate is associated with merging based on the smallest entropy. A comparison of no merging with merging based on the biggest entropy showed similar results. So, our final conclusion was that there are highly significant differences between no merging and merging, depending on the dataset. The best approach should be chosen by trying all three approaches.

Highlights

IntroductionDiscretization of numerical attributes is an important technique used in data mining
Discretization of numerical attributes is an important technique used in data mining.Discretization is the process of converting numerical values of data records into discrete values associated with numerical intervals defined over the domains of the data records
Our results show that such interval merging is crucial for quality of discretization

Summary

Introduction

Discretization of numerical attributes is an important technique used in data mining. As follows from recent research [13,34,35], one of the discretization methods, called multiple scanning and based on entropy, is especially successful. The quality of a cutpoint is estimated by the conditional entropy of the decision given an attribute. The best cutpoint is associated with the smallest conditional entropy. If the stopping condition is not satisfied, discretization is completed by another discretization method called Dominant Attribute [34,35]. Four other discretization methods, namely, the original C4.5 approach to discretization, and the same globalized versions of Equal Interval Width and Equal Frequency per Interval methods, and Multiple Scanning were compared in Reference [35]; this time, data mining was based on the C4.5 generation of decision trees. It was shown that the best discretization method is Multiple Scanning

Discretization

Multiple Scanning

Interval Merging

Experiments

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Entropy (Basel, Switzerland)	Publication Date: Nov 16, 2018
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Merging of Numerical Intervals in Entropy-Based Discretization.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy (Basel, Switzerland)

Lead the way for us

Similar Papers

On the mixture maximum likelihood approach to estimation and clustering
Selvanayagam Ganesalingam
Bulletin of the Australian Mathematical Society | VOL. 24
Selvanayagam GanesalingamSelvanayagam Ganesalingam
01 Oct 1981
Bulletin of the Australian Mathematical Society | VOL. 24

Rule Set Complexity in Mining Incomplete Data Using Global and Saturated Probabilistic Approximations
Patrick G Clark ... Teresa Mroczek
-
Patrick G Clark, et. al.Patrick G Clark ... Teresa Mroczek
01 Jan 2019
01 Jan 2019

Is Human Factors Ready for the Automobile?
Lyman M Forbes
Proceedings of the Human Factors Society Annual Meeting | VOL. 29
Lyman M ForbesLyman M Forbes
01 Oct 1985
Proceedings of the Human Factors Society Annual Meeting | VOL. 29

Attribute Selection Based on Reduction of Numerical Attributes During Discretization
Jerzy W Grzymała-Busse ... Teresa Mroczek
-
Jerzy W Grzymała-Busse, et. al.Jerzy W Grzymała-Busse ... Teresa Mroczek
17 Nov 2017
17 Nov 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Merging of Numerical Intervals in Entropy-Based Discretization.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy (Basel, Switzerland)