Most existing techniques to handle imbalanced data might be invalid in the presence of data missing since they are based on the assumption that the data is complete. To shorten such a gap, a novel synthetic minority oversampling technique (SMOTE), i.e., Non-negative latent factor analysis-incorporated and Switching triple-weight-SMOTE (NSS), is proposed. The main idea of NSS is 4-fold: 1) a Lagrange non-negative matrix factorization (LNMF) method is put forward to impute the missing values with a guaranteed non-negativity according to the original distribution owing to the consideration of the global feature information; 2) by mapping the complete data after imputation into an empirical feature space (EFS), a more separable dataset is achieved, which rigidly maintains the same geometrical structure as the original data while efficiently reducing the redundant features to enhance model generalization and operational efficiency; 3) after fulfilling the fuzzy <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$c$</tex-math> </inline-formula> -means (FCM) clustering, the inter-cluster distance, the capacity of each minority cluster and its sparsity are comprehensively taken into account to develop a triple-weight assignment strategy, which contributes to allocate the number of synthetic samples to each cluster appropriately; 4) a switching oversampling strategy is provided in response to the clusters with different distributions (i.e., either Gaussian or uniform distribution). Moreover, a posterior is used to check the correctness of the synthetic samples. Finally, experiments on a real dataset and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$12$</tex-math> </inline-formula> public datasets show that the proposed NSS outperforms other <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$11$</tex-math> </inline-formula> state-of-art methods. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —Data classification is an important computer task, which has been successfully used in many domains, including but not limited to, the medical domain, finance domain, and manufacturing domain. However, there are two big challenges when a classifier is to handle real-world data in the presence of imbalanced data and missing values. More specifically, the model performance is very likely to be degraded due to the missing information and the inclination of classifiers to the majority class. To surmount this problem, a natural idea is to use the LNMF model to obtain the desired recovery. Then, for the complete data after imputation, the empirical-feature-space-based switching triple-weight-SMOTE is applied to synthesize safe and correct data (i.e., lie solidly on the region of minority class) to achieve the balance. Such a working principle generates a novel NSS strategy. The proposed NSS strategy has the following obvious merits: 1) the imputation guarantees the similarity to the original dataset; and 2) new synthetic data are safely generated by taking adequate consideration of the information and distribution of datasets. Thus, the proposed NSS can greatly help improve the classification accuracy of real-world datasets.
Read full abstract