A New Approach to Improve the Topological Stability in Non-Linear Dimensionality Reduction

Mohammed Elhenawy,Sebastien Glaser,Andry Rakotonirainy,Mahmoud Masoud

doi:10.1109/access.2020.2973921

Mohammed Elhenawy, Sebastien Glaser + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.2973921

Copy DOI

Abstract

Dimensionality reduction in the machine learning field mitigates the undesired properties of high-dimensional spaces to facilitate classification, compression, and visualization of high-dimensional data. During the last decade, researchers proposed many new non-linear techniques for dimensionality reduction based on the assumption that data laying on or near a complex low-dimensional manifold is embedded in the high-dimensional space. On the other side, new techniques for dimensionality reduction aim to identify and extracting the manifold from the high-dimensional space. Isomap is one of widely-used low-dimensional embedding methods, where geodesic distances on a weighted graph are incorporated with the classical scaling (metric multidimensional scaling). The Isomap chooses the k-nearest neighbors based on the distance only which causes bridges and topological instability. In this paper, we propose a new algorithm to find the nearest neighbors to optimize the number of short-circuit errors and thus improve the topological stability. We assume that any point on the manifold and its nearest neighbors form a vector subspace and the orthogonal to that subspace is orthogonal to all vectors spans the vector subspace. Therefore, in the proposed algorithm, the point on the manifold and its nearest neighbors (i.e. the selected number of nearest neighbors based on the required subspace’s dimension) are used to find the bases of the subspace and the orthogonal to that subspace which belongs to the orthogonal complementary subspace. Then, candidate neighbors points will be added to the nearest neighbors based on the distance and the angle between each candidate point and the orthogonal to the subspace. The new algorithm is tested using low dimensional (3D) synthesized datasets and high dimensional real datasets. The superior performance of the new algorithm in choosing the nearest neighbors is confirmed through visually inspecting its capability find to the correct $2D$ representation of the synthesized datasets at different $k$ . Moreover, In the case of the high dimensional real datasets, the new algorithm yields a lower residual variance than the standard Isomap. Finally, we investigated using the new algorithm to find the nearest neighbors for the Locally Linear Embedding (LLE) algorithm. We tested the LLE variant using synthesized datasets and high dimensional real datasets and the results were promising

Highlights

Dimensionality reduction is the process of transforming a high-dimensional data-set into a meaningfully reduced dimensionality
The ideal goal of the dimension reduction is discovering the minimum number of parameters needed to account for the observed properties of the data, which is called the intrinsic dimensionality of data [1]
The results show good low dimensional representation at k = 3 and 10 and a perfect representation at k = 5 which is to our mind better than the t-Distributed Stochastic Neighbor Embedding (t-SNE) presentation

Summary

INTRODUCTION

Dimensionality reduction is the process of transforming a high-dimensional data-set into a meaningfully reduced dimensionality. Isomap estimates the geodesic distances between all pairs of points on the manifold M. Isomap to estimate the intrinsic geometry of a data manifold depends on choosing the correct neighbors for each point. Defining the connectivity of each data point via its nearest Euclidean neighbors in the high-dimensional is vulnerable to short-circuit errors [8]. These errors could happen if the folds in the manifold on which the data points lie are too close and the distances between these points are smaller than the true neighborhood distance. This angle should be close to 90 degrees to consider the candidate neighbor as neighbor

THEORETICAL BACKGROUND AND METHODS

EXPERIMENTAL WORK

CONCLUSION