Slope Heuristics Research Articles

Tree tensor networks, or tree-based tensor formats, are prominent model classes for the approximation of high-dimensional functions in computational and data science. They correspond to sum-product neural networks with a sparse connectivity associated with a dimension tree and widths given by a tuple of tensor ranks. The approximation power of these models has been proved to be (near to) optimal for classical smoothness classes. However, in an empirical risk minimization framework with a limited number of observations, the dimension tree and ranks should be selected carefully to balance estimation and approximation errors. We propose and analyze a complexity-based model selection method for tree tensor networks in an empirical risk minimization framework and we analyze its performance over a wide range of smoothness classes. Given a family of model classes associated with different trees, ranks, tensor product feature spaces and sparsity patterns for sparse tensor networks, a model is selected (\`a la Barron, Birg\'e, Massart) by minimizing a penalized empirical risk, with a penalty depending on the complexity of the model class and derived from estimates of the metric entropy of tree tensor networks. This choice of penalty yields a risk bound for the selected predictor. In a least-squares setting, after deriving fast rates of convergence of the risk, we show that our strategy is (near to) minimax adaptive to a wide range of smoothness classes including Sobolev or Besov spaces (with isotropic, anisotropic or mixed dominating smoothness) and analytic functions. We discuss the role of sparsity of the tensor network for obtaining optimal performance in several regimes. In practice, the amplitude of the penalty is calibrated with a slope heuristics method. Numerical experiments in a least-squares regression setting illustrate the performance of the strategy.

ABSTRACTAlthough there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e. data whose rows belong to the simplex) remains largely unexplored in cases where the observed value is equal or close to zero for one or more samples. This work is motivated by the analysis of two applications, both focused on the categorization of compositional profiles: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we make use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a non-asymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data.

Slope Heuristics Research Articles

Articles published on Slope Heuristics

Learning with tree tensor networks: Complexity estimates and model selection

Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data

Slope heuristics and V-Fold model selection in heteroscedastic regression using strongly localized bases

Clustering and variable selection for categorical multivariate data

Model Selection for Simplicial Approximation

Slope heuristics: overview and implementation

Data-driven penalty calibration: A case study for Gaussian mixture model selection

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Slope Heuristics Research Articles

Articles published on Slope Heuristics

Learning with tree tensor networks: Complexity estimates and model selection

Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data

Slope heuristics and V-Fold model selection in heteroscedastic regression using strongly localized bases

Clustering and variable selection for categorical multivariate data

Model Selection for Simplicial Approximation

Slope heuristics: overview and implementation

Data-driven penalty calibration: A case study for Gaussian mixture model selection