An Optimistic Data Mining Approach for Handling Large Data Set using Data Partitioning Techniques

Dipak V Patil,R S Bichkar

doi:10.5120/2930-3878

Abstract

The use of the Internet for various purposes leads to collection of large volume of data. The knowledge contents of large data can be utilized to improve decision-making process of an organization. The knowledge discovery on this high volume data becomes very slow, as it has to be done serially on currently available terabyte plus data sets. In some cases, mining of large data set may become impossible due to limitations of processor and memory. The proposed algorithm is based on Tim Oates and Davis Jensen's (1) findings which state that increasing size of training data does not considerably increase classification accuracy of a classifier. The proposed algorithm also follows survival of the fittest principal used in genetic algorithm. The solution provides partitioning algorithm wherein decision trees can be learned on partitioned data that are disjoint subsets of a complete data set. These learned decision trees have comparable accuracies with each other and that is equivalent to the tree learned on complete data set. The algorithm finds a single tree with highest accuracy amongst the learned decision trees. The selected decision tree is used for classification of unseen data. The results on 12 benchmark data sets from UCI data repository indicate that the final learned decision tree have equal accuracy and in many cases, significant improvement in classification accuracy is observed, improvement in classification performance as compared to decision trees learned on the entire data set. An experiment on big data set Census-income (KDD) also supports the claim. The most important aspect of this approach is that it is very simple as compared to other methods with enhanced classification performance. fittest. 1. INTRODUCTIONvolume of data in databases is growing to quite large sizes, both in the number of attributes and instances. Data mining provides tools for discovery of relationships, patterns, and knowledge in databases. Organizations use these databases to inference rules to boost their businesses. Data mining on a very large set of records from a database is quite complex task. The number of data records may overload a computer systems memory and processor making the learning process very slow. Data sets used for inference may be very large, may be up to terabytes. The solution to handle the large training data set is the divide and conquers technique. The proposed technique partitions given data set horizontally into number of non- overlapping subsets, trains classifiers on data and uses, the fittest solution amongst them for data mining task.

Full Text