Dimensional Subspace Research Articles

The seminal work of Cohen and Peng [10] (STOC 2015) introduced Lewis weight sampling to the theoretical computer science community, which yields fast row sampling algorithms for approximating \(d\) -dimensional subspaces of \(\ell_{p}\) up to \((1+ \varepsilon)\) relative error. Prior works have extended this important primitive to other settings, such as the online coreset and sliding window models [4] (FOCS 2020). However, these results are only for \(p\in\{1,2\}\) , and results for \(p=1\) require a suboptimal \(\tilde{O}(d^{2}/\varepsilon^{2})\) samples. In this work, we design the first nearly optimal \(\ell_{p}\) subspace embeddings for all \(p\in(0,\infty)\) in the online coreset and sliding window models. In both models, our algorithms store \(\tilde{O}(d/\varepsilon^{2})\) rows for \(p\in(0,2)\) and \(\tilde{O}(d^{p/2}/\varepsilon^{2})\) rows for \(p\in(2,\infty)\) . This answers a substantial generalization of the main open question of [4], gives the first results for all \(p\notin\{1,2\}\) , and achieves nearly optimal sample complexities for all \(p\) . Towards our result, we give the first analysis of “one-shot” Lewis weight sampling of sampling rows proportionally to their Lewis weights, which gives a sample complexity of \(\tilde{O}(d^{p/2}/\varepsilon^{2})\) rows for \(p>2\) . Previously, such a sampling scheme was only known to have a sample complexity of \(\tilde{O}(d^{p/2}/\varepsilon^{5})\) [10], whereas a bound of \(\tilde{O}(d^{p/2}/\varepsilon^{2})\) is known if a more sophisticated recursive sampling algorithm is used [20, 32]. Note that the recursive sampling strategy cannot be implemented in an online setting, thus necessitating an analysis of one-shot Lewis weight sampling. Perhaps surprisingly, our analysis crucially uses a novel connection to online numerical linear algebra, even for offline Lewis weight sampling . As an application, we obtain the first online coreset algorithms for \((1+\varepsilon)\) approximation of important generalized linear models, such as logistic regression and \(p\) -probit regression. Our upper bounds are parameterized by a complexity parameter \(\mu\) introduced by [31], and we also provide the first lower bounds showing that a linear dependence on \(\mu\) is necessary.

Read full abstract

BackgroundRandom forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study.MethodsFor the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000).ResultsThe visualizations suggested that the model learned “spikes of probability” around events in the training set. A cluster of events created a bigger peak or plateau (signal), isolated events local peaks (noise). In the simulation study, median training AUCs were between 0.97 and 1 unless there were 4 binary predictors or 16 binary predictors with a minimum node size of 20. The median discrimination loss, i.e., the difference between the median test AUC and the true AUC, was 0.025 (range 0.00 to 0.13). Median training AUCs had Spearman correlations of around 0.70 with discrimination loss. Median test AUCs were higher with higher events per variable, higher minimum node size, and binary predictors. Median training calibration slopes were always above 1 and were not correlated with median test slopes across scenarios (Spearman correlation − 0.11). Median test slopes were higher with higher true AUC, higher minimum node size, and higher sample size.ConclusionsRandom forests learn local probability peaks that often yield near perfect training AUCs without strongly affecting AUCs on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.

Read full abstract

Dimensional Subspace Research Articles

Articles published on Dimensional Subspace

Heart Disease Prediction using Hybrid Machine Learning Algorithms

Hierarchical Bayesian optimization based convolutional neural network for chest X-ray disease classification

Online Lewis Weight Sampling

Low-Complexity 2D DOA Estimation via L-Shaped Array for Underwater Hexapod Robot

Doubling the rate: improved error bounds for orthogonal projection with application to interpolation

Accurate channel estimation of on-grid partially coherent compressive phase retrieval for mmWave massive MIMO systems

Accelerated design of solid bio-based foams for plastics substitutes.

Direct localization of wideband sources using distributed arrays: A subspace focusing and dimension reduction approach

Detection of Energetic Low Dimensional Subspaces in Spatio-Temporal Space in Turbulent Pipe Flow

The index of equidimensional flag manifolds

Structural properties of Krylov subspaces, Krylov solvability, and applications to unbounded self-adjoint operators

Invariants in divided power algebras

High-Dimensional Bayesian Optimization Using Both Random and Supervised Embeddings

Unsupervised Feature Selection via Controllable Adaptive Graph Learning and Discriminative Feature Learning.

An auxiliary correction model implemented by the Correction Property of intermediate layer against adversarial examples

Light field images super-resolution method based on hybrid low-dimensional spatial–angular interaction feature and linear complementation epipolar feature

Environmental sound classification using raw-audio based ensemble framework

Understanding overfitting in random forest for probability estimation: a visualization and simulation study

The maximum principle for lumped-distributed control systems.

Minimal cover of high-dimensional chaotic attractors by embedded recurrent patterns

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Dimensional Subspace Research Articles

Articles published on Dimensional Subspace

Heart Disease Prediction using Hybrid Machine Learning Algorithms

Hierarchical Bayesian optimization based convolutional neural network for chest X-ray disease classification

Online Lewis Weight Sampling

Low-Complexity 2D DOA Estimation via L-Shaped Array for Underwater Hexapod Robot

Doubling the rate: improved error bounds for orthogonal projection with application to interpolation

Accurate channel estimation of on-grid partially coherent compressive phase retrieval for mmWave massive MIMO systems

Accelerated design of solid bio-based foams for plastics substitutes.

Direct localization of wideband sources using distributed arrays: A subspace focusing and dimension reduction approach

Detection of Energetic Low Dimensional Subspaces in Spatio-Temporal Space in Turbulent Pipe Flow

The index of equidimensional flag manifolds

Structural properties of Krylov subspaces, Krylov solvability, and applications to unbounded self-adjoint operators

Invariants in divided power algebras

High-Dimensional Bayesian Optimization Using Both Random and Supervised Embeddings

Unsupervised Feature Selection via Controllable Adaptive Graph Learning and Discriminative Feature Learning.

An auxiliary correction model implemented by the Correction Property of intermediate layer against adversarial examples

Light field images super-resolution method based on hybrid low-dimensional spatial–angular interaction feature and linear complementation epipolar feature

Environmental sound classification using raw-audio based ensemble framework

Understanding overfitting in random forest for probability estimation: a visualization and simulation study

The maximum principle for lumped-distributed control systems.

Minimal cover of high-dimensional chaotic attractors by embedded recurrent patterns