Abstract

To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

Highlights

  • To solve key biomedical problems, experimentalists routinely measure millions or billions of features per sample, with the hope that data science techniques will be able to build accurate data-driven inferences

  • Theorems guarantee that when the sample size n is large and the dimensionality p is relatively small, Linear Discriminant Analysis (LDA) converges to the optimal classifier under the Gaussian assumption

  • We consider a number of different methods, including Principal Components Analysis (PCA), rrLDA, Partial least squares (PLS), random projections (RP), and canonical correlation analysis (CCA) to project the data onto a low dimensional space

Read more

Summary

Introduction

To solve key biomedical problems, experimentalists routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. LDA has a number of highly desirable properties for a classifier It is based on simple geometric reasoning: when the data are Gaussian, all the information is in the means and variances, so the optimal classifier uses both the means and the variances. Theorems guarantee that when the sample size n is large and the dimensionality p is relatively small, LDA converges to the optimal classifier under the Gaussian assumption. The sample sizes have not experienced a concomitant increase This “large p, small n” problem is a non-starter for many classical statistical approaches because they were designed with a “small p, large n” situation in mind.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call