Abstract

The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.

Highlights

  • The volume of data in databases is growing to large sizes, both in the number of attributes and instances

  • Hybrid learning methodologies that integrate genetic algorithms (GAs) and decision tree learning for evolving optimal decision trees have been proposed by different authors

  • Prior to applying the proposed technique on large data sets, we found it appropriate to first test it on the normal sized benchmark data set from UCI repository [33]

Read more

Summary

INTRODUCTION

The volume of data in databases is growing to large sizes, both in the number of attributes and instances. Data mining provides tools to inference knowledge from databases. This knowledge is used to boost the businesses. Data mining on a very large data set may overload a computer systems memory and processor making the learning process very slow. Data cleaning and sampling reduces time complexity of decision tree learning. Due to data cleaning operation overfitting can be reduced and as erroneous data is removed the time complexity of pruning process can be reduced significantly. The classification filter is used to filter training data to improve data quality, subsequently incremental random sampling is applied on this filtered data.

Decision Tree Construction
DATA CLEANING
DATA SAMPLING
PROPOSED MODEL
METHOD OF EXPERIMENTATION
EFFECT OF CLEANING ON DATA
EFFECT OF SAMPLING ON DATA
EFFECT OF CLEANING AND SAMPLING ON DATA
EXPERIMENTS WITH LARGE DATA SET
ANALYSIS ON ACCURACY OF THE TREE AND SIZE OF TRAINING DATA
ANALYSIS ON TIME REQUIRED TO BUILD THE MODEL
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.