A New Automated Big Data Partitioning Approach to Improve Condensation Methods Performance

Mohamed Malhat,Hamdy Mousa,Ashraf El-Sisi,Mohamed El-Menshawy

doi:10.1109/icenco.2018.8636145

Abstract

The enormous amount of structured and unstructured data produced in many fields leads to the era of big data. These data make the existing mining algorithms ineffective to process it. Therefore, the data reduction techniques are principally utilized prior to applying data mining algorithms. The instance selection is one of the promising reduction techniques advocated to reduce the size-volume of training dataset via selecting most relevant instances. However, the traditional instance selection methods suffer from the scalability of data, due to memory limitations. Recent approaches proposed to partition the training dataset into subsets and apply instance selection methods to individual subsets. Most of these approaches are based on a random partitioning, which negatively affects the performance of the instance selection methods, especially for a high number of subsets. In this work, we propose a new partitioning approach called automated overlapped distance-based partitioning. Our approach assigns the instances to the subsets regarding the distance. The instances can be assigned to two subsets based on a defined threshold. We implement and test experimentally the proposed approach using six standard datasets and the CNN method, a standard instance-selection condensation method. The results demonstrate that our approach is better than current random approaches in terms of the reduction rate and effectiveness criteria. Moreover, our approach is able to maintain a high reduction rate and effectiveness results when the numbers of subsets is increasing.

Full Text