An ensemble-based semi-supervised learning approach for non-stationary imbalanced data streams with label scarcity
An ensemble-based semi-supervised learning approach for non-stationary imbalanced data streams with label scarcity
- Research Article
69
- 10.1016/j.knosys.2020.105694
- Feb 27, 2020
- Knowledge-Based Systems
Incremental learning imbalanced data streams with concept drift: The dynamic updated ensemble algorithm
- Book Chapter
6
- 10.1007/978-3-030-43887-6_30
- Jan 1, 2020
Learning from the non-stationary imbalanced data stream is a serious challenge to the machine learning community. There is a significant number of works addressing the issue of classifying non-stationary data stream, but most of them do not take into consideration that the real-life data streams may exhibit high and changing class imbalance ratio, which may complicate the classification task. This work attempts to connect two important, yet rarely combined, research trends in data analysis, i.e., non-stationary data stream classification and imbalanced data classification. We propose a novel framework for training base classifiers and preparing the dynamic selection dataset (DSEL) to integrate data preprocessing and dynamic ensemble selection (DES) methods for imbalanced data stream classification. The proposed approach has been evaluated on the basis of computer experiments carried out on 72 artificially generated data streams with various imbalance ratios, levels of label noise and types of concept drift. In addition, we consider six variations of preprocessing methods and four DES methods. Experimentation results showed that dynamic ensemble selection, even without the use of any data preprocessing, can outperform a naive combination of the whole pool generated with the use of preprocessing methods. Combining DES with preprocessing further improves the obtained results.
- Conference Article
19
- 10.1109/ijcnn48605.2020.9207118
- Jul 1, 2020
Learning from imbalanced data and data stream mining are among most popular areas in contemporary machine learning. There is a strong interplay between these domains, as data streams are frequently characterized by skewed distributions. However, most of existing works focus on binary problems, omitting significantly more challenging multi-class imbalanced data. In this paper, we propose a novel framework for learning from multi-class imbalanced data streams that simultaneously tackles three major problems in this area: (i) changing imbalance ratios among multiple classes; (ii) concept drift; and (iii) limited access to ground truth. We use active learning combined with streaming-based oversampling that uses both information about current class ratios and classifier errors on each class to create new instances in a meaningful way. Conducted experimental study shows that our single-classifier framework is capable of outperforming state-of-the-art ensembles dedicated to multi-class imbalanced data streams in both fully supervised and sparsely labeled learning scenarios.
- Research Article
29
- 10.1007/s00521-012-1071-6
- Jul 18, 2012
- Neural Computing and Applications
Classifying non-stationary and imbalanced data streams encompasses two important challenges, namely concept drift and class imbalance. Concept drift is changes in the underlying function being learnt, and class imbalance is vast difference between the numbers of instances in different classes of data. Class imbalance is an obstacle for the efficiency of most classifiers. Previous methods for classifying non-stationary and imbalanced data streams mainly focus on batch solutions, in which the classification model is trained using a chunk of data. Here, we propose two online classifiers. The classifiers are one-layer NNs. In the proposed classifiers, class imbalance is handled with two separate cost-sensitive strategies. The first one incorporates a fixed and the second one an adaptive misclassification cost matrix. The proposed classifiers are evaluated on 3 synthetic and 8 real-world datasets. The results show statistically significant improvements in imbalanced data metrics.
- Book Chapter
9
- 10.1007/978-3-030-19738-4_36
- May 8, 2019
The classification of data streams is a frequently considered problem. The data coming in over time has a tendency to change its characteristics over time and usually we also encounter some difficulties in data distributions as inequality of the number of learning examples from considered classes. The combination of these two phenomena is an additional challenge. In this article, we propose a novel MSRS (Multi Sampling Random Subspace Ensemble) a chunk-based ensemble method for imbalanced non-stationary data stream classification. The proposed algorithm employs random subspace approach and balancing data using various sampling methods to ensure an appropriate diversity of the classifier ensemble. MSRS has been evaluated on the basis of the computer experiments carried out on the diverse pool of the non-stationary imbalanced data streams.
- Research Article
65
- 10.1016/j.neucom.2013.05.003
- Jun 6, 2013
- Neurocomputing
Ensemble of online neural networks for non-stationary and imbalanced data streams
- Research Article
54
- 10.1016/j.knosys.2018.09.032
- Oct 3, 2018
- Knowledge-Based Systems
Selection-based resampling ensemble algorithm for nonstationary imbalanced stream data learning
- Conference Article
7
- 10.1109/icetet-sip-2254415.2022.9791638
- Apr 29, 2022
Data stream classification is a complex task in the real world due to its varying characteristics. The most common challenges are concept drift and class imbalance. Concept drift shifts in the underlying function generating the data. The biggest obstacle in achieving an effective classifier is due to class imbalance. In general, batch and ensemble solutions are mainly used to train the classifier with chunks of imbalanced data streams. In the paper, we proposed a framework called Self Organizing Auto-Encoder based 1D-CNN for Weather data forecasting. The objective of this framework is to classify imbalanced non-stationary data streams. A proposed model's concept drift handling strength is assessed using Population Stability Index (PSI). The Weather dataset is used for the evaluation of the proposed model. The results shows improvement in prediction accuracy and the classifier is stable with good PSI.
- Addendum
26
- 10.1007/s12652-020-01934-y
- Apr 11, 2020
- Journal of Ambient Intelligence and Humanized Computing
In many information system applications, the environment is dynamic and tremendous amount of streaming data is generated. This scenario enforces additional computational demand on the algorithm to process incoming instances incrementally using restricted memory and time compared to static data mining. Moreover, when the streams of data are collected from different sources, it may exhibit concept drift, which means the variation in the distribution of data and it can have a high degree of class imbalance. The problem of class imbalance occurs when there is a much lower number of an example representing one class than those of the other class. Concept drift and imbalanced streaming data are commonly found in real-world applications such as fraud detection, intrusion detection, decision support system and disease prediction. In this paper, the different concept drift detectors and handling approaches are analysed when dealing with imbalance data. A comparative analysis of concept drift is performed on various data sets like SEA synthetic data stream and real world datasets. Massive Online Analysis (MOA) tool is used to make the comparative study about different learners in a concept drifting environment. The performance measure such as Accuracy, Precision, Recall, F1-score and Kappa statistic has been used to evaluate the performance of the various learners on SEA synthetic data stream and real world dataset. Ensemble classifiers and single learners are employed and tested on the data samples of SEA synthetic data stream, electrical and KDD intrusion data set. The ensemble classifiers provide better accuracy when compared to the single classifier and ensemble based methods has shown good performance compared to strong single learners when dealing with concept drift and class imbalance data.
- Research Article
5
- 10.14257/ijunesst.2015.8.1.30
- Jan 31, 2015
- International Journal of u- and e-Service, Science and Technology
In order to lower the classification cost and improve the performance of the classifier, this paper proposes the approach of the dynamic cost-sensitive ensemble classification based on extreme learning machine for imbalanced massive data streams (DCECIMDS).Firstly, this paper gives the method of concept drifts detection by extracting the attributive characters of imbalanced massive data streams.If the change of attributive characters exceeds threshold value, the concept drift occurs.Secondly, we give Cost-sensitive extreme learning machine algorithm, and the optimal cost function is defined by the dynamic cost matrix.Build the cost-sensitive classifiers model for imbalanced massive data streams under MapReduce, and the data streams are processed in parallel.At last, the weighted costsensitive ensemble classifier is constructed, and the dynamic cost-sensitive ensemble classification based on extreme learning machine classification is given.The experiments demonstrate that the proposed ensemble classifier under the MapReduce framework can reduce the average misclassification cost and can make the classification results more reliable.DCECIMDS has high performance by comparing to the other classification algorithms for imbalanced data streams and can effectively deal with the concept drift.
- Research Article
5
- 10.1080/09720529.2015.1013709
- Mar 4, 2015
- Journal of Discrete Mathematical Sciences and Cryptography
Aiming at the limitation of incremental learning for the imbalanced massive data streams, this paper proposes the approach of the cost-sensitive incremental classification under the MapReduce framework for imbalanced massive data streams (CILCIDS). Firstly, this paper gives cost-sensitive concept drift detection for massive data stream under the MapReduce framework by counting the recession numbers of the pure and tolerance clusters. Secondly, we give Cost-sensitive SVM algorithm based on incremental learning. The new incremental samples can be divided into two parts, and only the samples against KKT conditions are used for the incremental learning. At last, the imbalanced massive data streams are divided under the MapReduce framework and are processed in parallel. The cost-sensitive incremental learning classification based on cloud computing platform is developed, and the weighted cost-sensitive ensemble classifier is constructed. The experiments show that the proposed incremental learning algorithm under the MapReduce framework is feasible and correct. CILCIDS has high performance by comparing to the other classification algorithms for imbalanced data streams, and can be effective to deal with unbalanced data stream with concept drift.
- Book Chapter
5
- 10.1007/978-981-13-5802-9_54
- Jan 1, 2019
In recent years, data stream has been considered as one of the primary sources of big data. Data stream has grown very rapidly in the last decades. Data stream environment has many features distinguishing the batch learning data which arrives on the fly with high speed. Data stream mining has attracted research focus due to its presence in many real-time applications such as telecommunication, networking, and banking. One of the most important challenges in data stream is the distribution of data is changing continuously which is leading to the phenomenon called “concept drift.” Another issue for streaming data is dealing with imbalanced class in the dataset. Many classification algorithms have been made to cope with the concept drift; however, many of them are dealing with the drift from the balanced data. In this paper, we propose a model called “CD2A: Concept Drift Detection Approach Toward Imbalanced Data Stream” which aims to handle the imbalanced data and detect the concept drift and behave equally with different types of drift. The algorithm was evaluated on real and synthetic dataset and compared with leading edge methods AWE, SMOTE, SERA, and OOB. Our method performs significantly better average prediction accuracy than the other compared methods.
- Conference Article
5
- 10.1109/inventive.2016.7824874
- Aug 1, 2016
Learning in data streams has practical significance in today's knowledge intensive era. Unlike static data mining, data stream mining requires handling with the critical issues related to the unbounded memory, one-scan nature, data with high arrival rate and few labels. In real nonstationary environments enormous data come with very high-speed and label scarcity. Manual labeling of such data is impractical considering requirements of expertise, time and cost. Consequently, learning in nonstationary data streams with label scarcity is being considered as a challenging task in the field of data stream mining. The present overview describes various semi-supervised learning techniques for classifying data streams with limited labeled data.
- Conference Article
2
- 10.1109/bigdata.2018.8622108
- Dec 1, 2018
Learning patterns from evolving data streams is challenging due to the characteristics of such streams: being continuous, unbounded and high speed data of non-stationary nature, which must be processed on the fly, using minimal computational resources. An additional challenge is imposed by the imbalanced data streams in many real-world applications, this difficulty becomes more prominent in multi-class learning tasks. This paper investigates the multi-class imbalance problem in non-stationary streams and develops a method to exploit real-time stream data and capture the dynamic of patterns from heterogeneous streams. In particular, we seek to extend concept drift adaptation techniques into imbalanced classes’ scenarios, and accordingly, we use an adaptive learner to classify multiple streams over a sequence of titled time windows. We include examples of the falsely classified instances in the training set, then we propose using a dynamic support threshold to discover the frequent patterns in these streams. We conduct an experiment on the car parking lots environment of a typical University with three simulated streams from sensors, smart pay stations and a mobile application. The result indicates the efficiency of applying adaptive learner approaches and modifying the training set to cope with the concept drift in multi-class imbalance scenarios, it also shows the merit of using a dynamic threshold to detect the rare patterns from evolving streams.
- Research Article
- 10.4172/2324-9307.1000148
- Jan 1, 2016
- Computer Engineering & Information Technology
An Effective Framework for Imbalanced Data Stream Classification Classifying data streams with skewed distribution finds many applications in realistic environments; however, only a few methods address this joint problem of data stream classification and imbalanced data learning. In this paper, we propose a novel importance sampling driven, dynamic feature group weighting framework (DFGW-IS) to tackle this problem. Our approach addresses the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concept is handled by an ensemble trained on a set of feature groups with each sub-classifier (i.e., a single classifier or an ensemble) being weighted by its discriminative power and stable level. The uneven class distribution, on the other hand, is battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We provide the theoretical analysis on the generalization error bound of the proposed algorithm. Extensive experiments on multiple skewed data streams demonstrate that the proposed algorithm not only outperforms the competing methods on standard evaluation metrics, but also adapts well in different learning scenarios.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.