INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS

Dipak V Patil,Rajankumar S Bichkar

doi:10.47839/ijc.11.3.565

Abstract

The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.

Highlights

The volume of data in databases is growing to large sizes, both in the number of attributes and instances
Hybrid learning methodologies that integrate genetic algorithms (GAs) and decision tree learning for evolving optimal decision trees have been proposed by different authors
Prior to applying the proposed technique on large data sets, we found it appropriate to first test it on the normal sized benchmark data set from UCI repository [33]

Summary

INTRODUCTION

The volume of data in databases is growing to large sizes, both in the number of attributes and instances. Data mining provides tools to inference knowledge from databases. This knowledge is used to boost the businesses. Data mining on a very large data set may overload a computer systems memory and processor making the learning process very slow. Data cleaning and sampling reduces time complexity of decision tree learning. Due to data cleaning operation overfitting can be reduced and as erroneous data is removed the time complexity of pruning process can be reduced significantly. The classification filter is used to filter training data to improve data quality, subsequently incremental random sampling is applied on this filtered data.

Decision Tree Construction

DATA CLEANING

DATA SAMPLING

PROPOSED MODEL

METHOD OF EXPERIMENTATION

EFFECT OF CLEANING ON DATA

EFFECT OF SAMPLING ON DATA

EFFECT OF CLEANING AND SAMPLING ON DATA

EXPERIMENTS WITH LARGE DATA SET

ANALYSIS ON ACCURACY OF THE TREE AND SIZE OF TRAINING DATA

ANALYSIS ON TIME REQUIRED TO BUILD THE MODEL

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computing

Lead the way for us

Journal: International Journal of Computing	Publication Date: Aug 1, 2014
License type: cc-by

Similar Papers

An Optimistic Data Mining Approach for Handling Large Data Set using Data Partitioning Techniques
Dipak V Patil ... R S Bichkar
International Journal of Computer Applications | VOL. 24
Dipak V Patil, et. al.Dipak V Patil ... R S Bichkar
30 Jun 2011
International Journal of Computer Applications | VOL. 24

Efficient Reduced Order Modeling of Large Data Sets Obtained from CFD Simulations
Thomas Holemans ... Zhu Yang
Fluids | VOL. 7
Thomas Holemans, et. al.Thomas Holemans ... Zhu Yang
17 Mar 2022
Fluids | VOL. 7

An Approach to Automation Selection of Decision Tree based on Training Data Set
D Saravanakumar ... N Ananthi
International Journal of Computer Applications | VOL. 64
D Saravanakumar, et. al.D Saravanakumar ... N Ananthi
15 Feb 2013
International Journal of Computer Applications | VOL. 64

Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry
Lukas Reiter ... Ruedi Aebersold
Molecular & Cellular Proteomics | VOL. 8
Lukas Reiter, et. al.Lukas Reiter ... Ruedi Aebersold
01 Nov 2009
Molecular & Cellular Proteomics | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computing