Abstract
A prime objective in constructing data streaming mining models is to achieve good accuracy, fast learning, and robustness to noise. Although many techniques have been proposed in the past, efforts to improve the accuracy of classification models have been somewhat disparate. These techniques include, but are not limited to, feature selection, dimensionality reduction, and the removal of noise from training data. One limitation common to all of these techniques is the assumption that the full training dataset must be applied. Although this has been effective for traditional batch training, it may not be practical for incremental classifier learning, also known as data stream mining, where only a single pass of the data stream is seen at a time. Because data streams can amount to infinity and the so-called big data phenomenon, the data preprocessing time must be kept to a minimum. This paper introduces a new data preprocessing strategy suitable for the progressive purging of noisy data from the training dataset without the need to process the whole dataset at one time. This strategy is shown via a computer simulation to provide the significant benefit of allowing for the dynamic removal of bad records from the incremental classifier learning process.
Highlights
Data preprocessing has traditionally referred to cleaning up the training dataset prior to sending it to the classifier construction learning process
Noise is known to be a cause of confusion in the construction of classification models and as a factor leading to a deterioration in accuracy
We regard noise as a contradicting instance that does not agree with the majority of data; this disagreement causes the establishment of erroneous rules in classification models and disrupts homogenous metaknowledge or statistical patterns by distorting the training dataset
Summary
Data preprocessing has traditionally referred to cleaning up the training dataset prior to sending it to the classifier construction learning process. The function doubles the total model learning time at worst (involving a test run of one round of classification to filter out instances that have been misclassified + the actual building of the model based only on correctly classified instances) It may not be an elegant solution if a user opts for a lightweight data preprocessing step that does not have to take as much time as building one dummy classifier in advance to distinguish between good and bad data. Those techniques are all principle-based solutions, and they are likely to share the same shortcoming when it comes to data stream mining.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have