Pearson Chi-Squared Conditional Randomization Test
Conditional independence (CI) testing arises naturally in many scientific problems and applications domains. The goal of this problem is to investigate the conditional independence between a response variable $Y$ and another variable $X$, while controlling for the effect of a high-dimensional confounding variable $Z$. In this paper, we introduce a novel test, called `Pearson Chi-squared Conditional Randomization' (PCR) test, which uses the distributional information on covariates $X,Z$ and constructs randomizations to test conditional independence. PCR leverages the i.i.d-ness property of the observations to obtain high-resolution p-values with a very small number of conditional randomizations. We also provide a power analysis of the PCR test, which captures the effect of various parameters of the test, the sample size and the distance of the alternative from the set of null distributions, measured in terms of a notion called `conditional relative density'. In addition, we propose two extensions of the PCR test, with important practical implications: $(i)$ parameter-free PCR, which uses Bonferroni's correction to decide on a tuning parameter in the test; $(ii)$ robust PCR, which avoids inflations in the size of the test when there is slight error in estimating the conditional law $P_{X|Z}$.
- Research Article
8
- 10.1515/jci-2018-0004
- Jan 18, 2019
- Journal of Causal Inference
A benefit of randomized experiments is that covariate distributions of treatment and control groups are balanced on average, resulting in simple unbiased estimators for treatment effects. However, it is possible that a particular randomization yields covariate imbalances that researchers want to address in the analysis stage through adjustment or other methods. Here we present a randomization test that conditions on covariate balance by only considering treatment assignments that are similar to the observed one in terms of covariate balance. Previous conditional randomization tests have only allowed for categorical covariates, while our randomization test allows for any type of covariate. Through extensive simulation studies, we find that our conditional randomization test is more powerful than unconditional randomization tests and other conditional tests. Furthermore, we find that our conditional randomization test is valid (1) unconditionally across levels of covariate balance, and (2) conditional on particular levels of covariate balance. Meanwhile, unconditional randomization tests are valid for (1) but not (2). Finally, we find that our conditional randomization test is similar to a randomization test that uses a model-adjusted test statistic.
- Research Article
23
- 10.1093/biomet/asab039
- Jul 8, 2021
- Biometrika
We consider the problem of conditional independence testing: given a response and covariates , we test the null hypothesis that . The conditional randomization test was recently proposed as a way to use distributional information about to exactly and nonasymptotically control Type-I error using any test statistic in any dimensionality without assuming anything about . This flexibility, in principle, allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the conditional randomization test is prohibitively computationally expensive, especially with multiple testing, due to the requirement to recompute the test statistic many times on resampled data. We propose the distilled conditional randomization test, a novel approach to using state-of-the-art machine learning algorithms in the conditional randomization test while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the conditional randomization test's statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks, like screening and recycling computations, to further speed up the conditional randomization test without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to the most powerful existing conditional randomization test implementations, but requires orders of magnitude less computation, making it a practical tool even for large datasets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.
- Research Article
- 10.1007/s11749-023-00861-2
- May 2, 2023
- TEST
Testing whether a variable of interest affects the outcome is one of the most fundamental problems in statistics and is often the main scientific question of interest. To tackle this problem, the conditional randomization test (CRT) is widely used to test the independence of variable(s) of interest (X) with an outcome (Y) holding other variable(s) (Z) fixed. The CRT uses “Model-X” inference framework that relies solely on the iid sampling of (X, Z) to produce exact finite-sample p values that are constructed using any test statistic. We propose a new method, the adaptive randomization test (ART), that tackles the same independence problem while allowing the data to be adaptively sampled. Like the CRT, the ART relies solely on knowing the (adaptive) sampling distribution of (X, Z). Although the ART allows practitioners to flexibly design and analyze adaptive experiments, the method itself does not guarantee a powerful adaptive sampling procedure. For this reason, we show substantial power gains obtained from adaptively sampling compared to the typical iid sampling procedure in a multi-arm bandit setting and an application in conjoint analysis. We believe that the proposed adaptive procedure is successful because it takes arms that may initially look like “fake” signals due to random chance and stabilizes them closer to “null” signals and samples more/less from signal/null arms.
- Research Article
5
- 10.1002/pst.1556
- Feb 14, 2013
- Pharmaceutical Statistics
Proschan, Brittain, and Kammerman made a very interesting observation that for some examples of the unequal allocation minimization, the mean of the unconditional randomization distribution is shifted away from 0. Kuznetsova and Tymofyeyev linked this phenomenon to the variations in the allocation ratio from allocation to allocation in the examples considered in the paper by Proschan et al. and advocated the use of unequal allocation procedures that preserve the allocation ratio at every step. In this paper, we show that the shift phenomenon extends to very common settings: using conditional randomization test in a study with equal allocation. This phenomenon has the same cause: variations in the allocation ratio among the allocation sequences in the conditional reference set, not previously noted. We consider two kinds of conditional randomization tests. The first kind is the often used randomization test that conditions on the treatment group totals; we describe the variations in the conditional allocation ratio with this test on examples of permuted block randomization and biased coin randomization. The second kind is the randomization test proposed by Zheng and Zelen for a multicenter trial with permuted block central allocation that conditions on the within-center treatment totals. On the basis of the sequence of conditional allocation ratios, we derive the value of the shift in the conditional randomization distribution for specific vector of responses and the expected value of the shift when responses are independent identically distributed random variables. We discuss the asymptotic behavior of the shift for the two types of tests.
- Research Article
12
- 10.1002/sim.8418
- Dec 17, 2019
- Statistics in Medicine
We examine the use of randomization-based inference for analyzing multiarmed randomized clinical trials, including the application of conditional randomization tests to multiple comparisons. The view is taken that the linkage of the statistical test to the experimental design (randomization procedure) should be recognized. A selected collection of randomization procedures generalized to multiarmed treatment allocation is summarized, and generalizations for two randomization procedures that heretofore were designed for only two treatments are developed. We explain the process of computing the randomization test and conditional randomization test via Monte Carlo simulation, developing an efficient algorithm that makes multiple comparisons possible that would not be possible using a standard algorithm, demonstrate the preservation of type I error rate, and explore the relationship of statistical power to the randomization procedure in the presence of a time trend and outliers. We distinguish between the interpretation of the p-value in the randomization test and in the population test and verify that the randomization test can be approximated by the population test on some occasions. Data from two multiarmed clinical trials from the literature are reanalyzed to illustrate the methodology.
- Research Article
41
- 10.1515/jci-2015-0018
- Mar 1, 2016
- Journal of Causal Inference
We consider the conditional randomization test as a way to account for covariate imbalance in randomized experiments. The test accounts for covariate imbalance by comparing the observed test statistic to the null distribution of the test statistic conditional on the observed covariate imbalance. We prove that the conditional randomization test has the correct significance level and introduce original notation to describe covariate balance more formally. Through simulation, we verify that conditional randomization tests behave like more traditional forms of covariate adjustment but have the added benefit of having the correct conditional significance level. Finally, we apply the approach to a randomized product marketing experiment where covariate information was collected after randomization.
- Research Article
2
- 10.1017/pan.2023.41
- Feb 8, 2024
- Political Analysis
Conjoint analysis is a popular experimental design used to measure multidimensional preferences. Many researchers focus on estimating the average marginal effects of each factor while averaging over the other factors. Although this allows for straightforward design-based estimation, the results critically depend on the ways in which factors interact with one another. An alternative model-based approach can compute various quantities of interest, but requires correct model specifications, a challenging task for conjoint analysis with many factors. We propose a new hypothesis testing approach based on the conditional randomization test (CRT) to answer the most fundamental question of conjoint analysis: Does a factor of interest matter in any way given the other factors? Although it only provides a formal test of these binary questions, the CRT is solely based on the randomization of factors, and hence requires no modeling assumption. This means that the CRT can provide a powerful and assumption-free statistical test by enabling the use of any test statistic, including those based on complex machine learning algorithms. We also show how to test commonly used regularity assumptions. Finally, we apply the proposed methodology to conjoint analysis of immigration preferences. An open-source software package is available for implementing the proposed methodology. The proposed methodology is implemented via an open-source software R package CRTConjoint, available through the Comprehensive R Archive Network https://cran.r-project.org/web/packages/CRTConjoint/index.html.
- Research Article
16
- 10.1080/10618600.2021.1923520
- Apr 30, 2021
- Journal of Computational and Graphical Statistics
We propose the holdout randomization test (HRT), an approach to feature selection using black box predictive models. The HRT is a specialized version of the conditional randomization test (CRT) (Candes et al., 2018) that uses data splitting for feasible computation. The HRT works with any predictive model and produces a valid p-value for each feature. To make the HRT more practical, we propose a set of extensions to maximize power and speed up computation. In simulations, these extensions lead to greater power than a competing knockoffs-based approach, without sacrificing control of the error rate. We apply the HRT to two case studies from the scientific literature where heuristics were originally used to select important features for predictive models. The results illustrate how such heuristics can be misleading relative to principled methods like the HRT. Code is available at https://github.com/tansey/hrt.
- Research Article
- 10.1609/aaai.v39i21.34354
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Conditional independence (CI) testing is a fundamental task in modern statistics and machine learning. The conditional randomization test (CRT) was recently introduced to test whether two random variables, X and Y, are conditionally independent given a potentially high-dimensional set of random variables, Z. The CRT operates exceptionally well under the assumption that the conditional distribution X|Z is known. However, since this distribution is typically unknown in practice, accurately approximating it becomes crucial. In this paper, we propose using conditional diffusion models (CDMs) to learn the distribution of X|Z. Theoretically and empirically, it is shown that CDMs closely approximate the true conditional distribution. Furthermore, CDMs offer a more accurate approximation of X|Z compared to GANs, potentially leading to a CRT that performs better than those based on GANs. To accommodate complex dependency structures, we utilize a computationally efficient classifier-based conditional mutual information (CMI) estimator as our test statistic. The proposed testing procedure performs effectively without requiring assumptions about specific distribution forms or feature dependencies, and is capable of handling mixed-type conditioning sets that include both continuous and discrete variables. Theoretical analysis shows that our proposed test achieves a valid control of the type I error. A series of experiments on synthetic data demonstrates that our new test effectively controls both type-I and type-II errors, even in high dimensional scenarios.
- Research Article
- 10.1609/aaai.v37i7.26039
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
The conditional randomization test (CRT) was recently proposed to test whether two random variables X and Y are conditionally independent given random variables Z. The CRT assumes that the conditional distribution of X given Z is known under the null hypothesis and then it is compared to the distribution of the observed samples of the original data. The aim of this paper is to develop a novel alternative of CRT by using nearest-neighbor sampling without assuming the exact form of the distribution of X given Z. Specifically, we utilize the computationally efficient 1-nearest-neighbor to approximate the conditional distribution that encodes the null hypothesis. Then, theoretically, we show that the distribution of the generated samples is very close to the true conditional distribution in terms of total variation distance. Furthermore, we take the classifier-based conditional mutual information estimator as our test statistic. The test statistic as an empirical fundamental information theoretic quantity is able to well capture the conditional-dependence feature. We show that our proposed test is computationally very fast, while controlling type I and II errors quite well. Finally, we demonstrate the efficiency of our proposed test in both synthetic and real data analyses.
- Research Article
9
- 10.1214/22-ejs2085
- Jan 1, 2022
- Electronic Journal of Statistics
For testing conditional independence (CI) of a response Y and a predictor X given covariates Z, the model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their application to genome-wide association studies. In this paper, we study the power of MX CI tests, yielding quantitative insights into the role of machine learning and providing evidence in favor of using likelihood-based statistics in practice. Focusing on the conditional randomization test (CRT), we find that its conditional mode of inference allows us to reformulate it as testing a point null hypothesis involving the conditional distribution of X. The Neyman-Pearson lemma implies that a likelihood-based statistic yields the most powerful CRT against a point alternative. We obtain a related optimality result for MX knockoffs. Switching to an asymptotic framework with arbitrarily growing covariate dimension, we derive an expression for the power of the CRT against local semiparametric alternatives in terms of the prediction error of the machine learning algorithm on which its test statistic is based. Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error control under the assumption that only the first two moments of X given Z are known.
- Research Article
93
- 10.1177/0049124112437535
- Feb 1, 2012
- Sociological Methods & Research
Interference between units may pose a threat to unbiased causal inference in randomized controlled experiments. Although the assumption of no interference is often necessary for causal inference, few options are available for testing this assumption. This article presents an ex post method for detecting interference between units in randomized experiments. With a test statistic of the analyst’s choice, a conditional randomization test allows for the calculation of the exact significance level of the causal dependence of outcomes on the treatment status of other units. The robustness of the method is demonstrated through simulation studies. Moreover, using this method, interference between units is detected in a field experiment designed to assess the effect of mailings on voter turnout in a U.S. primary election.
- Research Article
9
- 10.1093/biomet/asab052
- Nov 2, 2021
- Biometrika
SummaryIn many scientific applications, researchers aim to relate a response variable $Y$ to a set of potential explanatory variables $X = (X_1,\dots,X_p)$, and start by trying to identify variables that contribute to this relationship. In statistical terms, this goal can be understood as trying to identify those $X_j$ on which $Y$ is conditionally dependent. Sometimes it is of value to simultaneously test for each $j$, which is more commonly known as variable selection. The conditional randomization test, CRT, and model-X knockoffs are two recently proposed methods that respectively perform conditional independence testing and variable selection by computing, for each $X_j$, any test statistic on the data and assessing that test statistic’s significance, by comparing it with test statistics computed on synthetic variables generated using knowledge of the distribution of $X$. The main contribution of this article is the analysis of the power of these methods in a high-dimensional linear model, where the ratio of the dimension $p$ to the sample size $n$ converges to a positive constant. We give explicit expressions for the asymptotic power of the CRT, variable selection with CRT $p$-values, and model-X knockoffs, each with a test statistic based on the marginal covariance, the least squares coefficient or the lasso. One useful application of our analysis is direct theoretical comparison of the asymptotic powers of variable selection with CRT $p$-values and model-X knockoffs; in the instances with independent covariates that we consider, the CRT probably dominates knockoffs. We also analyse the power gain from using unlabelled data in the CRT when limited knowledge of the distribution of $X$ is available, as well as the power of the CRT when samples are collected retrospectively.
- Research Article
2
- 10.1080/07474940802240969
- Aug 22, 2008
- Sequential Analysis
For the generalized biased coin class of randomization procedures, Smythe (1988) proved asymptotic normality of the conditional linear rank test. Clinical trialists often undertake interim analysis to determine whether to stop the trial early for a substantial treatment effect. In this article, we will set up one interim analysis using a conditional randomization test. The joint asymptotic distribution of the interim test statistic and the final test statistic will be explored. We also define the concept of conditional information under a randomization model.
- Research Article
27
- 10.1214/11-aos941
- Feb 1, 2012
- The Annals of Statistics
Sequential monitoring in clinical trials is often employed to allow for early stopping and other interim decisions, while maintaining the type I error rate. However, sequential monitoring is typically described only in the context of a population model. We describe a computational method to implement sequential monitoring in a randomization-based context. In particular, we discuss a new technique for the computation of approximate conditional tests following restricted randomization procedures and then apply this technique to approximate the joint distribution of sequentially computed conditional randomization tests. We also describe the computation of a randomization-based analog of the information fraction. We apply these techniques to a restricted randomization procedure, Efron's [Biometrika 58 (1971) 403--417] biased coin design. These techniques require derivation of certain conditional probabilities and conditional covariances of the randomization procedure. We employ combinatoric techniques to derive these for the biased coin design.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.