DATA DIMENSIONALITY REDUCTION THROUGH CLUSTER TREES AND MANIFOLD LEARNING

Ali Amani

doi:10.23860/thesis-amani-ali-2021

Ali Amani

Open Access

PDF Available

https://doi.org/10.23860/thesis-amani-ali-2021

Copy DOI

Export

Save

Cite

Publication Date: Jan 1, 2021

License type: cc-by

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Dimensionality reduction algorithms are a commonly used solution to create a visual summary of high dimensional data in a way that makes identification of patterns and trends easier. Algorithms that are used to visualize data as 2 or 3 dimensional plots are popular options, even more so due to clustering and manifold learning. There already exist many tools, both linear and nonlinear, that are used in visualizing high dimensional data, three of the most popular being PCA, t-SNE and UMAP. PCA has low memory requirements and is efficient in low dimensions, t-SNE captures much of the local structure of high dimensional data while also revealing factors like presence of clusters, and UMAP has no computational restrictions on embedding dimension. Despite each of their respective advantages, all three of these tools have noticeable drawbacks. t-SNE and UMAP both have hyperparameters which require tuning to get visualizations of any value. PCA cannot recover nonlinear structure, so there can be significant loss of the global structure when applying that algorithm to data. These drawbacks prompt the development of new (mostly nonlinear) tools for visualizing high dimensional data. The reason for which we would want to visualize high dimensional data in the first place is because humans are incapable of seeing in more than three dimensions. Reducing the dimension of high dimensional data enables us not only to view the data, but to notice patterns and easier detect anomalous data points. Manifold learning is one approach to getting a simplified low dimensional version of higher dimensional data. This machine learning tool is used in the visualization of high dimensional data by describing these datasets as low dimensional manifolds embedded into higher dimensional space. Clustering is a machine learning approach that groups together individual data points in a way that provides value. Clustering simplifies a large high dimensional dataset by showing clusters, or organized groups of data points, rather than all the data points individually. Hierarchical clustering applies this principle by first organizing datasets into one large cluster, and then recursively dividing the current cluster(s) until a specific criteria is met that finds the optimal “level” of this process, or the optimal clusters which represent the dataset. Clustering algorithms are usually more effective in lower dimensions due to the “curse of dimensionality”, or the issues which arise when analyzing high dimensional data that do not occur in lower dimensions. For this reason, if we want

Full Text