Imbalanced Data Distribution Research Articles

Decision support systems for surveillance rely more and more on face recognition (FR) to detect target individuals of interest captured with video cameras. FR is a challenging problem in video surveillance due to variations in capture conditions, to camera interoperability, and to the limited representativeness of target facial models used for matching. Although adaptive classifier ensembles have been applied for robust face matching, it is often assumed that the proportions of faces captured for target and non-target individuals are balanced, known a priori, and do not change over time. Recently, some techniques have been proposed to adapt the fusion function of an ensemble according to class imbalance of the input data stream. For instance, Skew-Sensitive Boolean combination (SSBC) is a active approach that estimates target vs. non-target proportions periodically during operations using Hellinger distance, and adapts its ensemble fusion function to operational class imbalance. Beyond the challenges of estimating class imbalance, such techniques commonly generate diverse pools of classifiers by selecting balanced training data, limiting the potential diversity produced using the abundant non-target data. In this paper, adaptive skew-sensitive ensembles are proposed to combine classifiers trained by selecting data with varying levels of imbalance and complexity, to sustain a high level the performance for video-to-video FR. Faces captured for each person in the scene are tracked and regrouped into trajectories. During enrollment, captures in a reference trajectory are combined with selected non-target captures to generate a pool of 2-class classifiers using data with various levels of imbalance and complexity. During operations, the level of imbalance is periodically estimated from the input trajectories using the HDx quantification method, and pre-computed histogram representations of imbalanced data distributions. This approach allows one to adapt pre-computed histograms and ensemble fusion functions based on the imbalance and complexity of operational data. Finally, the ensemble scores are accumulated of trajectories for robust spatio-temporal recognition. Results on synthetic data show that adapting the fusion function of ensemble trained with different complexities and levels of imbalance can significantly improve performance. Results on the Face in Action video data show that the proposed method can outperform reference techniques (including SSBC and meta-classification) in imbalanced video surveillance environments. Transaction-based analysis shows that performance is consistently higher across operational imbalances. Individual-specific analysis indicates that goat- and lamb-like individuals can benefit the most from adaptation to the operational imbalance. Finally, trajectory-based analysis shows that a video-to-video FR system based on the proposed approach can maintain, and even improve overall system discrimination.

고객반응 예측모형은 마케팅 프로모션을 제공할 목표고객을 효과적으로 선정할 수 있도록 하여 프로모션의 효과를 극대화 할 수 있도록 해준다. 오늘날과 같은 빅데이터 환경에서는 데이터 마이닝 기법을 적용하여 고객반응 예측모형을 구축하고 있으며 본 연구에서는 사례기반추론 기반의 고객반응 예측모형을 제시하였다. 일반적으로 사례기반추론 기반의 예측모형은 타 인공지능기법에 비해 성과가 낮다고 알려져 있으나 입력변수의 중요도에 따라 가중치를 상이하게 적용함으로써 예측성과를 향상시킬 수 있다. 본 연구에서는 프로모션에 대한 고객의 반응여부에 영향을 미치는 중요도에 따라 입력변수의 가중치를 산출하여 적용하였으며 동일한 가중치를 적용한 예측모형과의 성과를 비교하였다. 목욕세제 판매데이터를 사용하여 고객반응 예측모형을 개발하고 로짓모형의 계수를 적용하여 입력변수의 중요도에 따라 가중치를 산출하였다. 실증분석 결과 각 변수의 중요도에 기반하여 가중치를 적용한 예측모형이 동일한 가중치를 적용한 예측모형보다 높은 예측성과를 보여주었다. 또한 고객 반응예측 모형과 같이 실생활의 분류문제에서는 두 범주에 속하는 데이터의 수가 현격한 차이를 보이는 불균형 데이터가 대부분이다. 이러한 데이터의 불균형 문제는 기계학습 알고리즘의 성능을 저하시키는 요인으로 작용하며 본 연구에서 제안한 Weighted CBR이 불균형 환경에서도 안정적으로 적용할 수 있는지 검증하였다. 전체데이터에서 100개의 데이터를 무작위로 추출한 불균형 환경에서 100번 반복하여 예측성과를 비교해 본 결과 본 연구에서 제안한 Weighted CBR은 불균형 환경에서도 일관된 우수한 성과를 보여주었다. Response modeling is a well-known research issue for those who have tried to get more superior performance in the capability of predicting the customers' response for the marketing promotion. The response model for customers would reduce the marketing cost by identifying prospective customers from very large customer database and predicting the purchasing intention of the selected customers while the promotion which is derived from an undifferentiated marketing strategy results in unnecessary cost. In addition, the big data environment has accelerated developing the response model with data mining techniques such as CBR, neural networks and support vector machines. And CBR is one of the most major tools in business because it is known as simple and robust to apply to the response model. However, CBR is an attractive data mining technique for data mining applications in business even though it hasn't shown high performance compared to other machine learning techniques. Thus many studies have tried to improve CBR and utilized in business data mining with the enhanced algorithms or the support of other techniques such as genetic algorithm, decision tree and AHP (Analytic Process Hierarchy). Ahn and Kim(2008) utilized logit, neural networks, CBR to predict that which customers would purchase the items promoted by marketing department and tried to optimized the number of k for k-nearest neighbor with genetic algorithm for the purpose of improving the performance of the integrated model. Hong and Park(2009) noted that the integrated approach with CBR for logit, neural networks, and Support Vector Machine (SVM) showed more improved prediction ability for response of customers to marketing promotion than each data mining models such as logit, neural networks, and SVM. This paper presented an approach to predict customers' response of marketing promotion with Case Based Reasoning. The proposed model was developed by applying different weights to each feature. We deployed logit model with a database including the promotion and the purchasing data of bath soap. After that, the coefficients were used to give different weights of CBR. We analyzed the performance of proposed weighted CBR based model compared to neural networks and pure CBR based model empirically and found that the proposed weighted CBR based model showed more superior performance than pure CBR model. Imbalanced data is a common problem to build data mining model to classify a class with real data such as bankruptcy prediction, intrusion detection, fraud detection, churn management, and response modeling. Imbalanced data means that the number of instance in one class is remarkably small or large compared to the number of instance in other classes. The classification model such as response modeling has a lot of trouble to recognize the pattern from data through learning because the model tends to ignore a small number of classes while classifying a large number of classes correctly. To resolve the problem caused from imbalanced data distribution, sampling method is one of the most representative approach. The sampling method could be categorized to under sampling and over sampling. However, CBR is not sensitive to data distribution because it doesn't learn from data unlike machine learning algorithm. In this study, we investigated the robustness of our proposed model while changing the ratio of response customers and nonresponse customers to the promotion program because the response customers for the suggested promotion is always a small part of nonresponse customers in the real world. We simulated the proposed model 100 times to validate the robustness with different ratio of response customers to response customers under the imbalanced data distribution. Finally, we found that our proposed CBR based model showed superior performance than compared models under the imbalanced data sets. Our study is expected to improve the performance of response model for the promotion program with CBR under imbalanced data distribution in the real world.

Imbalanced Data Distribution Research Articles

Related Topics

Articles published on Imbalanced Data Distribution

Visual Perception-Based Statistical Modeling of Complex Grain Image for Product Quality Monitoring and Supervision on Assembly Production Line.

Restricted Boltzmann machines based oversampling and semi-supervised learning for false positive reduction in breast CAD

Perfopticon: Visual Query Analysis for Distributed Databases

Adaptive skew-sensitive ensembles for face recognition in video surveillance

불균형 데이터 환경에서 변수가중치를 적용한 사례기반추론 기반의 고객반응 예측

An effective Weighted Multi-class Least Squares Twin Support Vector Machine for Imbalanced data classification

Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

An Empirical Study for Software Fault-Proneness Prediction with Ensemble Learning Models on Imbalanced Data Sets

Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure–Activity Relationship and Machine Learning Methods

Image-level and group-level models for Drosophilagene expression pattern annotation

Skew-sensitive boolean combination for adaptive ensembles – An application to face recognition in video surveillance

A Cluster-Based Extra-Feature-Adding Approach for Imbalanced Data Distributions

Lattice-based clustering and genetic programming for coordinate transformation in GPS applications

Churn prediction in telecom using Random Forest and PSO based data balancing in combination with various feature selection strategies

Index Filtering Algorithm Based on Minimum Enclosing Circle Partition

MAPLSC: A novel multi-class classifier for medical diagnosis

An unsupervised self-organizing learning with support vector ranking for imbalanced datasets

Cluster-based under-sampling approaches for imbalanced data distributions

A study in machine learning from imbalanced data for sentence boundary detection in speech

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Imbalanced Data Distribution Research Articles

Related Topics

Articles published on Imbalanced Data Distribution

Visual Perception-Based Statistical Modeling of Complex Grain Image for Product Quality Monitoring and Supervision on Assembly Production Line.

Restricted Boltzmann machines based oversampling and semi-supervised learning for false positive reduction in breast CAD

Perfopticon: Visual Query Analysis for Distributed Databases

Adaptive skew-sensitive ensembles for face recognition in video surveillance

불균형 데이터 환경에서 변수가중치를 적용한 사례기반추론 기반의 고객반응 예측

An effective Weighted Multi-class Least Squares Twin Support Vector Machine for Imbalanced data classification

Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

An Empirical Study for Software Fault-Proneness Prediction with Ensemble Learning Models on Imbalanced Data Sets

Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure–Activity Relationship and Machine Learning Methods

Image-level and group-level models for Drosophilagene expression pattern annotation

Skew-sensitive boolean combination for adaptive ensembles – An application to face recognition in video surveillance

A Cluster-Based Extra-Feature-Adding Approach for Imbalanced Data Distributions

Lattice-based clustering and genetic programming for coordinate transformation in GPS applications

Churn prediction in telecom using Random Forest and PSO based data balancing in combination with various feature selection strategies

Index Filtering Algorithm Based on Minimum Enclosing Circle Partition

MAPLSC: A novel multi-class classifier for medical diagnosis

An unsupervised self-organizing learning with support vector ranking for imbalanced datasets

Cluster-based under-sampling approaches for imbalanced data distributions

A study in machine learning from imbalanced data for sentence boundary detection in speech