Kernel Matrix Approximation on Class-Imbalanced Data With an Application to Scientific Simulation

Parisa Hajibabaee,Farhad Pourkamali-Anaraki,Mohammad Amin Hariri-Ardebili

doi:10.1109/access.2021.3087730

Abstract

Generating low-rank approximations of kernel matrices that arise in nonlinear machine learning techniques holds the potential to significantly alleviate the memory and computational burdens. A compelling approach centers on finding a concise set of exemplars or landmarks to reduce the number of similarity measure evaluations from quadratic to linear concerning the data size. However, a key challenge is to regulate tradeoffs between the quality of landmarks and resource consumption. Despite the volume of research in this area, current understanding is limited regarding the performance of landmark selection techniques in the presence of class-imbalanced data sets that are becoming increasingly prevalent in many applications. Hence, this paper provides a comprehensive empirical investigation using several real-world imbalanced data sets, including scientific data, by evaluating the quality of approximate low-rank decompositions and examining their influence on the accuracy of downstream tasks. Furthermore, we present a new landmark selection technique called Distance-based Importance Sampling and Clustering (DISC), in which the relative importance scores are computed for improving accuracy-efficiency tradeoffs compared to existing works that range from probabilistic sampling to clustering methods. The proposed landmark selection method follows a coarse-to-fine strategy to capture the intrinsic structure of complex data sets, allowing us to substantially reduce the computational complexity and memory footprint with minimal loss in accuracy.

Highlights

We consider the problem of producing accurate and efficient low-rank approximations of kernel matrices that appear in a wide range of nonlinear machine learning techniques [1]–[3]
EXPERIMENTAL RESULTS ON IMBALANCED DATA The three real-world imbalanced data sets that we consider are listed in Table 1, describing the data size n, the number of negative and positive samples n− and n+, and the input space dimension p
The values of the hyperparameter r for the three data sets are given in Table 1, where the best rank-r approximation of the resulting kernel matrix K using eigenvalue decomposition captures most of its spectral energy

Summary

Introduction

We consider the problem of producing accurate and efficient low-rank approximations of kernel matrices that appear in a wide range of nonlinear machine learning techniques [1]–[3]. A kernel matrix K ∈ Rn×n is a positive semidefinite matrix that contains pairwise similarity measures between n samples in a given data set X = {x1, . Where κ : X ×X → R denotes a kernel function that satisfies the Mercer’s condition [4], [5], and Φ : x → Φ(x) represents the kernel-induced feature map for each data sample x ∈ X. A popular approach for calculating the similarity between pairs of samples employs the Gaussian radial basis function kernel, known as the squared exponential kernel [6], [7]: κ(xi, xj) = exp(−ρ2(xi, xj)), ρ(xi, xj) := xi − xj 2/σ, (2). Well-known examples include support vector machines [8], kernel ridge regression [9], [10], kernel k-means clustering [11], and kernel principal component analysis [12], [13]

Objectives

Results

Conclusion