An Evaluation of Supervised Dimensionality Reduction For Large Scale Data

Nancy Jan Sliper

doi:10.53759/7669/jmc202202003

Abstract

Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.

Highlights

The science and technology of predicting statistical correlations using labeled learning data, known as supervised learning, has allowed a broad range of fundamental and applied discoveries, from detecting indicators in omic data to identifying objects from images
We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections
To reflect the data into a low-dimensional domain, we investigate a variety of approaches, like PCA, reduced rank LDA (rrLDA), Partial Least Squares (PLS), Randomized Projection (RP), and Canonical Correlations Analyses (CCA)

Summary

Introduction

The science and technology of predicting statistical correlations using labeled learning data, known as supervised learning, has allowed a broad range of fundamental and applied discoveries, from detecting indicators in omic data to identifying objects from images. Classification is a kind of supervised learning in which a classifier estimates the "classes" of a new input (for instance, by forecasting sex from Magnetic resonance imaging scanning). Fisher's Linear Discriminant Analysis (LDA) is one of the most fundamental and basic techniques to classification. LDA offers a number of extremely desired qualities. It is founded on straightforward geometric rationale: when the input is Gaussian, the averages and deviations include all of the information, the best classifier employs both of them. Theorems ensure that, underneath the Gaussian assumption, When the small subset n is huge and p dimensionality is small, LDA [1] is the best classifier for the problem. The methods used to put it into action are quite effective

Methods

Results

Discussion

Conclusion