An Investigation on the Impact of Non-Uniform Random Sampling Techniques for t-SNE

Matej Buljan,Rafael M Martins,Jonas Nordqvist

doi:10.1109/sweds51247.2020.9275586

Abstract

t-Distributed Statistical Neighbor Embedding (t-SNE) is a dimensionality reduction technique that has gained much popularity for its increased capability of creating low-dimensional embeddings that preserve well-separated clusters from high-dimensional spaces. Despite its strengths, the running times for t-SNE are usually high and do not scale well with the size of datasets, which limits its applicability to scenarios that involve, for example, Big Data and interactive visualization. Downsampling the dataset into more manageable sizes is a possible straightforward workaround, but it is not clear from the literature how much the quality of the embedding suffers from the downsampling, and whether uniform random sampling is indeed the best possible solution. In this paper, we report on a thorough series of experiments performed to draw conclusions about the quality of embeddings from running t-SNE on samples of data using different sampling techniques: uniform random sampling, random walk sampling, our proposed affinity-based random walk sampling, and the so-called hubness sampling. Throughout our testing, the affinity-based variant of random walk sampling distinguished itself as a promising alternative to uniform random sampling.

Full Text