Abstract

In the era of a large number of tools and applications that constantly produce massive amounts of data, their processing and proper classification is becoming both increasingly hard and important. This task is hindered by changing the distribution of data over time, called the concept drift, and the emergence of a problem of disproportion between classes—such as in the detection of network attacks or fraud detection problems. In the following work, we propose methods to modify existing stream processing solutions—Accuracy Weighted Ensemble (AWE) and Accuracy Updated Ensemble (AUE), which have demonstrated their effectiveness in adapting to time-varying class distribution. The introduced changes are aimed at increasing their quality on binary classification of imbalanced data. The proposed modifications contain the inclusion of aggregate metrics, such as F1-score, G-mean and balanced accuracy score in calculation of the member classifiers weights, which affects their composition and final prediction. Moreover, the impact of data sampling on the algorithm’s effectiveness was also checked. Complex experiments were conducted to define the most promising modification type, as well as to compare proposed methods with existing solutions. Experimental evaluation shows an improvement in the quality of classification compared to the underlying algorithms and other solutions for processing imbalanced data streams.

Highlights

  • Data stream analysis has recently become an increasingly popular topic in the pattern recognition field [1,2]

  • This paper presents a novel proposition extending state-of-the-art streaming data processing methods with modified weighting metrics for member-classifiers, taking into account the prior probability of classes present during the flow of data stream containing various types of concept drift phenomenon

  • An in-depth experimental analysis of the proposed methods was carried out, including three standard aggregated metrics used to assess the quality prediction models constructed on imbalanced classification problems, as well as statistical testing to verify the significance of differences between models

Read more

Summary

Introduction

Data stream analysis has recently become an increasingly popular topic in the pattern recognition field [1,2]. The characteristics of data streams leads to some indefeasible requirements for classifiers operating in their environment: fast data processing in which each object may be presented for training only once, low memory consumption, the possibility of prediction at any time and the ability to adapt to the changing distribution of problem classes [13]. This is to reduce imbalance based on non-synthetic data (as opposed to artificially increasing the number) This solution, does not take into account the possibility of changing the distribution of minority class over time, and violates the principle stating that one sample should be used once. The aim of the following work is to propose the modification of popular ensemble models so that they employ the imbalanced classification metrics in the weighting of classifier members and compare them with existing data stream processing solutions. The paper shows preliminary research of the topic, it will focus on the binary classification task

Accuracy Weighted Ensemble
WEIGHTING METHOD
Experimental Evaluation
METHOD
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.