Abstract

Learning from high-dimensional imbalanced data is a challenging research problem in machine learning, due to the curse of dimensionality caused by high dimension and the learning bias resulted from class imbalance. The existing works generally apply dimension reduction methods to reduce the dimensionality of features first, and then deal with the class imbalance problem by traditional imbalanced learning technologies. However, dimensionality reduction may cause the loss of useful information and cannot effectively address the problem of hubness which is an important aspect of the curse of dimensionality. In this paper, we present a hubness-aware cluster-based ensemble algorithm, HUSBoost, for learning high- dimensional imbalanced data. For hubs induced by high dimensionality, HUSBoost introduces discount factors to slow down the excessive growth of their weights, so as to alleviate the negative impacts of "bad" hubs on the classification decisions of component classifiers. To address the class imbalance problem, HUSBoost utilizes a cluster-based majority undersampling method to correct imbalanced class distribution. Specifically, k- hubs clustering technology is used to divide the majority samples into multiple clusters, and then the representative majority samples are selected from each cluster so as to form the balanced class distribution. Experimental results based on sixteen high-dimensional imbalanced data sets show the effectiveness of HUSBoost.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call