Sort by
Repulsion, chaos, and equilibrium in mixture models

Abstract Mixture models are commonly used in applications with heterogeneity and overdispersion in the population, as they allow the identification of subpopulations. In the Bayesian framework, this entails the specification of suitable prior distributions for the weights and locations of the mixture. Despite their popularity, the flexibility of these models often does not translate into the interpretability of the clusters. To overcome this issue, repulsive mixture models have been recently proposed. The basic idea is to include a repulsive term in the distribution of the atoms of the mixture, favouring mixture locations far apart. This approach induces well-separated clusters, aiding the interpretation of the results. However, these models are usually not easy to handle due to unknown normalizing constants. We exploit results from equilibrium statistical mechanics, where the molecular chaos hypothesis implies that nearby particles spread out over time. In particular, we exploit the connection between random matrix theory and statistical mechanics and propose a novel class of repulsive prior distributions based on Gibbs measures associated with joint distributions of eigenvalues of random matrices. The proposed framework greatly simplifies computations thanks to the availability of the normalizing constant in closed form. We investigate the theoretical properties and clustering performance of the proposed distributions.

Just Published
Relevant
Corrected generalized cross-validation for finite ensembles of penalized estimators

Abstract Generalized cross-validation (GCV) is a widely used method for estimating the squared out-of-sample prediction risk that employs scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least-squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of the ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. Furthermore, in the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV.

Open Access Just Published
Relevant
Rank-transformed subsampling: inference for multiple data splitting and exchangeable <i>p</i>-values

Abstract Many testing problems are readily amenable to randomized tests, such as those employing data splitting. However, despite their usefulness in principle, randomized tests have obvious drawbacks. Firstly, two analyses of the same dataset may lead to different results. Secondly, the test typically loses power because it does not fully utilize the entire sample. As a remedy to these drawbacks, we study how to combine the test statistics or p-values resulting from multiple random realizations, such as through random data splits. We develop rank-transformed subsampling as a general method for delivering large-sample inference about the combined statistic or p-value under mild assumptions. We apply our methodology to a wide range of problems, including testing unimodality in high-dimensional data, testing goodness-of-fit of parametric quantile regression models, testing no direct effect in a sequentially randomized trial and calibrating cross-fit double machine learning confidence intervals. In contrast to existing p-value aggregation schemes that can be highly conservative, our method enjoys Type I error control that asymptotically approaches the nominal level. Moreover, compared to using the ordinary subsampling, we show that our rank transform can remove the first-order bias in approximating the null under alternatives and greatly improve power.

Open Access Just Published
Relevant
Extended fiducial inference: toward an automated process of statistical inference

Abstract While fiducial inference was widely considered a big blunder by R.A. Fisher, the goal he initially set—‘inferring the uncertainty of model parameters on the basis of observations’—has been continually pursued by many statisticians. To this end, we develop a new statistical inference method called extended Fiducial inference (EFI). The new method achieves the goal of fiducial inference by leveraging advanced statistical computing techniques while remaining scalable for big data. Extended Fiducial inference involves jointly imputing random errors realized in observations using stochastic gradient Markov chain Monte Carlo and estimating the inverse function using a sparse deep neural network (DNN). The consistency of the sparse DNN estimator ensures that the uncertainty embedded in observations is properly propagated to model parameters through the estimated inverse function, thereby validating downstream statistical inference. Compared to frequentist and Bayesian methods, EFI offers significant advantages in parameter estimation and hypothesis testing. Specifically, EFI provides higher fidelity in parameter estimation, especially when outliers are present in the observations; and eliminates the need for theoretical reference distributions in hypothesis testing, thereby automating the statistical inference process. Extended Fiducial inference also provides an innovative framework for semisupervised learning.

Open Access
Relevant
Graphical criteria for the identification of marginal causal effects in continuous-time survival and event-history analyses

Abstract We consider continuous-time survival and event-history settings, where our aim is to graphically represent causal structures allowing us to characterize when a causal parameter is identified from observational data. This causal parameter is formalized as the effect on an outcome event of a (possibly hypothetical) intervention on the intensity of a treatment process. To establish identifiability, we propose novel graphical rules indicating whether the observed information is sufficient to obtain the desired causal effect by suitable reweighting. This requires a different type of graph than in discrete time. We formally define causal semantics for the corresponding dynamic graphs that represent local independence models for multivariate counting processes. Importantly, our work highlights that causal inference from censored data relies on subtle structural assumptions on the censoring process beyond independent censoring; these can be verified graphically. Put together, our results are the first to establish graphical rules for nonparametric causal identifiability in event processes in this generality for the continuous-time case, not relying on particular parametric survival models. We conclude with a data example on Human papillomavirus (HPV) testing for cervical cancer screening, where the assumptions are illustrated graphically and the desired effect is estimated by reweighted cumulative incidence curves.

Open Access
Relevant