Windowing as a Sub-Sampling Method for Distributed Data Mining

Francisco Grimaldo,Nicandro Cruz-Ramírez,Alejandro Guerra-Hernández,David Martínez-Galicia,Xavier Limón

doi:10.3390/mca25030039

Abstract

Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the higher the accuracy of the obtained model, the fewer examples used to induce it. This paper contributes to a better understanding of this behavior in order to promote windowing as a sub-sampling method for Distributed Data Mining. For this, the generalization of the behavior of windowing beyond decision trees is established, by corroborating the observed negative correlation when adopting inductive algorithms of different nature. Then, focusing on decision trees, the windows (samples) and the obtained models are analyzed in terms of Minimum Description Length (MDL), Area Under the ROC Curve (AUC), Kulllback–Leibler divergence, and the similitude metric Sim1; and compared to those obtained when using traditional methods: random, balanced, and stratified samplings. It is shown that the aggressive sampling performed by windowing, up to 3% of the original dataset, induces models that are significantly more accurate than those obtained from the traditional sampling methods, among which only the balanced sampling is comparable in terms of AUC. Although the considered informational properties did not correlate with the obtained accuracy, they provide clues about the behavior of windowing and suggest further experiments to enhance such understanding and the performance of the method, i.e., studying the evolution of the windows over time.

Highlights

Windowing is a sub-sampling method that enabled the decision tree inductive algorithms ID3 [1,2,3] and C4.5 [4,5] to cope with large datasets, i.e., those whose size precludes loading them in memory
Statistical tests about significant gains produced by windowing using the former metrics
Of the inductive method used with windowing, high accuracies correlate with aggressive samplings up to 3% of the original datasets

Summary

Introduction

Windowing is a sub-sampling method that enabled the decision tree inductive algorithms ID3 [1,2,3] and C4.5 [4,5] to cope with large datasets, i.e., those whose size precludes loading them in memory. Algorithm 1 defines the method: First, a window is created by extracting a small random sample of the available examples in the full dataset. The main step consists of inducing a model with that window and of testing it on the remaining examples, such that all misclassified examples are moved to the window. This step iterates until a stop condition is reached, e.g., all the available examples are correctly classified or a desired level of accuracy is reached.

Methods

Results

Conclusion