Feature selection for high-dimensional imbalanced class datasets using Harmony Search and Kullback–Leibler divergence
Feature selection for high-dimensional imbalanced class datasets using Harmony Search and Kullback–Leibler divergence
- Research Article
31
- 10.3390/genes11070717
- Jun 27, 2020
- Genes
The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.
- Conference Article
10
- 10.1109/icassp.2008.4518669
- Mar 1, 2008
Kullback Leibler (KL) divergence is widely used as a measure of dissimilarity between two probability distributions; however, the required integral is not tractable for gaussian mixture models (GMMs), and naive Monte-Carlo sampling methods can be expensive. Our work aims to improve the estimation of KL divergence for GMMs by sampling methods. We show how to accelerate Monte-Carlo sampling using variational approximations of the KL divergence. To this end we employ two different methodologies, control variates, and importance sampling. With control variates we use sampling to estimate the difference between the variational approximation and the the unknown KL divergence. With importance sampling, we estimate the KL divergence directly, using a sampling distribution derived from the variational approximation. We show that with these techniques we can achieve improvements in accuracy equivalent to using a factor of 30 times more samples.
- Conference Article
1
- 10.23919/ecc.2013.6669156
- Jul 1, 2013
This article proposes to monitor industrial process faults using Kullback Leibler (KL) divergence. The main idea is to measure the difference between the distributions of normal and faulty data. Sensitivity analysis on the KL divergence under Gaussian distribution assumption is performed, which shows that the sensitivity of KL divergence increases with the number of samples. For non-Gaussian data, a recently proposed kernel method for density ratio estimation is used to estimate the KL divergence. The density ratio estimation method does not involve direct estimation of probability density functions, hence is fast and efficient. For monitoring of non-Gaussian data, the confidence limits are obtained through a window based strategy. Application studies involving a simulation example and an industrial melter process show that the performance of the proposed monitoring strategy is better than the principal component analysis (PCA) based statistical local approach.
- Conference Article
987
- 10.1109/icassp.2007.366913
- Apr 1, 2007
The Kullback Leibler (KL) divergence is a widely used tool in statistics and pattern recognition. The KL divergence between two Gaussian mixture models (GMMs) is frequently needed in the fields of speech and image recognition. Unfortunately the KL divergence between two GMMs is not analytically tractable, nor does any efficient computational algorithm exist. Some techniques cope with this problem by replacing the KL divergence with other functions that can be computed efficiently. We introduce two new methods, the variational approximation and the variational upper bound, and compare them to existing methods. We discuss seven different techniques in total and weigh the benefits of each one against the others. To conclude we evaluate the performance of each one through numerical experiments.
- Research Article
30
- 10.1609/aaai.v33i01.33015066
- Jul 17, 2019
- Proceedings of the AAAI Conference on Artificial Intelligence
The variational autoencoder (VAE) is a powerful generative model that can estimate the probability of a data point by using latent variables. In the VAE, the posterior of the latent variable given the data point is regularized by the prior of the latent variable using Kullback Leibler (KL) divergence. Although the standard Gaussian distribution is usually used for the prior, this simple prior incurs over-regularization. As a sophisticated prior, the aggregated posterior has been introduced, which is the expectation of the posterior over the data distribution. This prior is optimal for the VAE in terms of maximizing the training objective function. However, KL divergence with the aggregated posterior cannot be calculated in a closed form, which prevents us from using this optimal prior. With the proposed method, we introduce the density ratio trick to estimate this KL divergence without modeling the aggregated posterior explicitly. Since the density ratio trick does not work well in high dimensions, we rewrite this KL divergence that contains the high-dimensional density ratio into the sum of the analytically calculable term and the lowdimensional density ratio term, to which the density ratio trick is applied. Experiments on various datasets show that the VAE with this implicit optimal prior achieves high density estimation performance.
- Research Article
2
- 10.1007/s11222-024-10480-y
- Aug 13, 2024
- Statistics and Computing
The Kullback–Leibler (KL) divergence is frequently used in data science. For discrete distributions on large state spaces, approximations of probability vectors may result in a few small negative entries, rendering the KL divergence undefined. We address this problem by introducing a parameterized family of substitute divergence measures, the shifted KL (sKL) divergence measures. Our approach is generic and does not increase the computational overhead. We show that the sKL divergence shares important theoretical properties with the KL divergence and discuss how its shift parameters should be chosen. If Gaussian noise is added to a probability vector, we prove that the average sKL divergence converges to the KL divergence for small enough noise. We also show that our method solves the problem of negative entries in an application from computational oncology, the optimization of Mutual Hazard Networks for cancer progression using tensor-train approximations.
- Research Article
61
- 10.1016/j.datak.2012.08.001
- Aug 17, 2012
- Data & Knowledge Engineering
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets
- Conference Article
2
- 10.1109/iai50351.2020.9262170
- Oct 23, 2020
This paper focus on abnormal condition detection by using Kullback Leibler (KL) divergence with relative importance function. There exist multimodal working conditions, such as normal condition, abnormal condition. KL method was proved to be more sensitive to initial faults than the Hotelling's T-squared statistic. Relative importance function estimation for condition detection has been demonstrated, and relative importance function is always smoother than corresponding ordinary density-ratios. In cement raw meal calcination process, we sampled some important variables, such as calciner temperature, preheater C1 outlet temperature, raw meal flow, and C1 and C5 cone pressure. In actual process, the product quality index is low and it is easy to cause the preheater C5 feeding tube to be blocked. To detect abnormal condition, an abnormal condition detection based on Kullback Leibler divergence with relative importance function was proposed. The actual application results shows that the model proposed can detect abnormal condition by current operating data, and far from fault condition by the practical application results.
- Conference Article
10
- 10.1109/bibe.2014.61
- Nov 1, 2014
One of the more prevalent problems when working with bioinformatics datasets is class imbalance, when there are more instances in one class compared to the other class (es). This problem is made worse because frequently, the class of interest is also the minority class. A possible solution is data sampling, a powerful tool for combating class imbalance by adding or removing instances to make the dataset more balanced. In addition to the choice of including data sampling, one of the most important decisions when applying data is what the final class ratio should be. Commonly, the final class ratio when data is applied is 50:50, however it is an open question whether other ratios are more appropriate for certain imbalanced datasets (all datasets in this paper have 25.16% minority instances or less) where a 50:50 ratio requires extreme modification to the dataset. In this work we compare six different data approaches (feature selection with the pair wise combinations of three data techniques and two final class ratios) with feature selection without data with the goal of determining if the inclusion of data is beneficial and if so, what should be the final class ratio. In order to test the six data approaches and feature selection alone thoroughly, we utilize seven imbalanced and high-dimensional datasets, three feature selection techniques, and six classifiers. Our results show that for a majority of scenarios, random under along with either 35:65 or 50:50 is the best data approach. Statistical analysis shows that there is significant difference between the data approaches. However, despite this, we still recommend using random under along with 35:65 as the final class ratio. This is because of the frequency of both random under and 35:65 being the most frequent top performing data technique and class ratio respectively. Additionally, 35:65 will have fewer negative impacts than 50:50 (less data loss or over fitting, which makes it a better choice if all other factors are equal) and random under is more computationally efficient than any other form of sampling, including no sampling (both by not requiring any internal calculations and by producing a reduced, easier-to-work-with dataset). To our knowledge, this is the most comprehensive work which focuses on the choice of the inclusion and implementation of data with different final class ratios on bioinformatics datasets which exhibit such large levels of class imbalance.
- Book Chapter
20
- 10.1007/978-3-642-23577-1_24
- Jan 1, 2011
The INEX Question Answering track ([email protected]) aims to evaluate a complex question-answering task using the Wikipedia. The set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Long answers have been evaluated based on Kullback Leibler (KL) divergence between n-gram distributions. This allowed summarization systems to participate. Most of them generated a readable extract of sentences from top ranked documents by a state-of-the-art document retrieval engine. Participants also tested several methods of question disambiguation. Evaluation has been carried out on a pool of real questions from OverBlog and Yahoo! Answers. Results tend to show that the baseline-restricted focused IR system minimizes KL divergence but misses readability meanwhile summarization systems tend to use longer and standalone sentences thus improving readability but increasing KL divergence.
- Conference Article
15
- 10.1109/icassp.2009.4959929
- Apr 1, 2009
Automatic classification of electroencephalography (EEG) signals, for different type of mental activities, is an active area of research and has many applications such as brain computer interface (BCI) and medical diagnoses. We introduce a simple yet effective way to use Kullback-Leibler (KL) divergence in the classification of raw EEG signals. We show that k-nearest neighbor (k-NN) algorithm with KL divergence as the distance measure, when used using our feature vectors, gives competitive classification accuracy and consistently outperforms the more commonly used Euclidean k-NN. We also develop and demonstrate the use of a KL-based kernel to classify EEG data using support vector machines (SVMs). Our KL-distance based kernel compares favorably to other well established kernels such as linear and radial basis function (RBF) kernel. The EEG data, used in our experiments for classification, was recorded while the subject performed 5 different mental activities such as math problem solving, letter composing, 3-D block rotation, counting and resting (baseline). We present classification results for this data set that are obtained by using raw EEG data with no explicit artifact removal in the pre-processing steps.
- Research Article
21
- 10.1186/1471-2105-10-s4-s7
- Apr 1, 2009
- BMC Bioinformatics
BackgroundIn many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information.ResultsWe propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers.ConclusionOur idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets.
- Research Article
15
- 10.1016/j.eswa.2021.116028
- Oct 9, 2021
- Expert Systems with Applications
Unstructured borderline self-organizing map: Learning highly imbalanced, high-dimensional datasets for fault detection
- Research Article
27
- 10.1109/tifs.2021.3092050
- Jan 1, 2021
- IEEE Transactions on Information Forensics and Security
In recent years, the threat of profiling attacks using deep learning has emerged. Successful attacks have been demonstrated against various types of cryptographic modules. However, the application of deep learning to side-channel attacks (SCAs) is often not adequately assessed because the labels that are widely used in SCAs, such as the Hamming weight (HW) and Hamming distance (HD), follow an imbalanced distribution. This study analyzes and solves the problems caused by dataset imbalance during training and inference. First, we state the reasons for the negative effect of data imbalance in classification for deep-learning-based SCAs and introduce the Kullback-Leibler (KL) divergence as a metric to measure this effect. Using the KL divergence, we demonstrate through analysis how the recently reported cross-entropy ratio loss function can solve the problem of imbalanced data. We further propose a method to solve dataset imbalance at the inference phase, which utilizes a likelihood function based on the key value instead of the HW/HD. The proposed method can be easily applied in deep-learning-based SCAs because it only needs an extra multiplication of the inverted binomial coefficients and inference results (i.e., the output probabilities) from the conventionally trained model. The proposed solution corresponds to data-augmentation techniques at the training phase, and furthermore, it better estimates the keys because the probability distributions of the training and test data are preserved. We demonstrate the validity of our analysis and the effectiveness of our solution through extensive experiments on two public databases.
- Research Article
- 10.15408/p46m2f12
- May 31, 2025
- Applied Information System and Management (AISM)
This study presents a novel approach to improving repeat buyer classification on e-commerce platforms by integrating Kullback-Leibler (KL) divergence with logistic regression and focused feature engineering techniques. Repeat buyers are a critical segment for driving long-term revenue and customer retention, yet identifying them accurately poses challenges due to class imbalance and the complexity of consumer behavior. This research uses KL divergence in a new way to help choose important features and evaluate the model, making it easier to understand and more effective at classifying repeat buyers, unlike traditional methods. Using a real-world dataset from Indonesian e-commerce with 1,000 records, divided into 80% for training and 20% for testing, the study uses logistic regression along with techniques like SMOTE for oversampling, class weighting, and regularization to fix issues with data imbalance and overfitting. Model performance is assessed using accuracy, precision, recall, F1-score, and KL divergence. Experimental results indicate that the KL-enhanced logistic regression model significantly outperforms the baseline, especially in balancing precision and recall for the minority class of repeat buyers. The unique contribution of this work lies in its synergistic use of KL divergence in both the feature engineering and evaluation phases, offering a robust, interpreted, and data-efficient solution. For e-commerce businesses, the findings translate into improved targeting of high-value customers, better personalization of marketing efforts, and more strategic allocation of resources. This research offers practical tips for enhancing predictive customer analytics and supports data-driven decision-making in digital commerce environments.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.