- Research Article
- 10.1111/sjos.70023
- Oct 2, 2025
- Scandinavian Journal of Statistics
- Kai Z Teh + 2 more
Abstract Most constraint‐based causal learning algorithms provably return the correct causal graph under certain correctness conditions, such as faithfulness. By representing any constraint‐based causal learning algorithm using the notion of a property, we provide a general framework to obtain and study correctness conditions for these algorithms. From the framework, we provide exact correctness conditions for the PC algorithm, which are then related to the correctness conditions of some other existing causal discovery algorithms. The framework also suggests a paradigm for designing causal learning algorithms, which allows for the correctness conditions of algorithms to be controlled for before designing the actual algorithm, and has the following implications. We show that the sparsest Markov representation condition is the weakest correctness condition for algorithms that output ancestral graphs or directed acyclic graphs satisfying any existing notions of minimality. We also reason that Pearl‐minimality is necessary for meaningful causal learning but not sufficient to relax the faithfulness condition and, as such, has to be strengthened, such as by including background knowledge, for causal learning beyond faithfulness.
- Research Article
- 10.1111/sjos.70024
- Oct 1, 2025
- Scandinavian Journal of Statistics
- Bo Hu + 2 more
Abstract Definitive screening designs (DSDs) are a novel class of three‐level designs that yield efficient estimates of main effects without aliasing with second‐order effects. While orthogonal and pairwise blocking schemes have been proposed for DSDs, their theoretical properties remain partially unexplored, resulting in computational challenges in searching for optimal blocked DSDs. In this paper, we obtain several theoretical insights into optimal blocked DSDs under the linear‐plus‐quadratic effects model. Our results not only demonstrate the optimality of some existing blocked DSDs but also substantially alleviate the complexities involved in identifying optimal orthogonal blocked DSDs.
- Research Article
- 10.1111/sjos.70021
- Sep 25, 2025
- Scandinavian Journal of Statistics
- Joni Virta + 2 more
Abstract A general approach for extending estimators to matrix‐ and tensor‐valued data is proposed. The extension is based on using random projections to project out dimensions of a tensor and then computing a multivariate estimator for each projection. The mean of the obtained set of estimates is used as the final, joint estimate. In some basic cases, the resulting estimator can be given a closed form, and particular ones are shown to coincide with existing methodology. We derive sufficient conditions for the consistency and limiting normality of the resulting estimators under weak assumptions. In particular, limiting normality is retained as soon as the number of projections grows super‐linearly in the sample size, and consistency is achieved regardless of the growth rate. Comparisons with competing methods show that the extensions prove useful in extracting components for classification and yield an efficient estimator for sufficient dimension reduction.
- Research Article
- 10.1111/sjos.70022
- Sep 24, 2025
- Scandinavian Journal of Statistics
- Miaomiao Su + 1 more
Abstract Subsampling is an effective approach to alleviate the computational burden associated with large‐scale datasets. Nevertheless, existing subsampling estimators incur a substantial loss in estimation efficiency compared to estimators based on the full dataset. Specifically, the convergence rate of existing subsampling estimators is typically rather than , where and denote the subsample and full data sizes, respectively. This paper proposes a subsampled one‐step (SOS) method to mitigate the estimation efficiency loss through a one‐step update based on the asymptotic expansions of the subsampling and full‐data estimators. The resulting SOS estimator is computationally efficient and achieves a fast convergence rate of rather than . We establish the asymptotic distribution of the SOS estimator, which can be non‐normal in general and construct confidence intervals on top of the asymptotic distribution. Furthermore, we prove that the SOS estimator is asymptotically normal and equivalent to the full data‐based estimator when . Simulation studies and real data analyses were conducted to demonstrate the finite sample performance of the SOS estimator. Numerical results suggest that the SOS estimator is almost as computationally efficient as the uniform subsampling estimator while achieving estimation efficiency similar to the full data‐based estimator.
- Research Article
- 10.1111/sjos.70020
- Sep 20, 2025
- Scandinavian Journal of Statistics
- Emily Berg + 2 more
Abstract Non‐probability samples are prevalent in various fields, such as biomedical studies, educational research, and business investigations, owing to the escalating challenges associated with declining response rates and the cost‐effectiveness and convenience of utilizing such samples. However, relying on naive estimates derived from non‐probability samples, without adequate adjustments, may introduce bias into study outcomes. Addressing this concern, data integration methodologies, which amalgamate information from both probability and non‐probability samples, have demonstrated effectiveness in mitigating selection bias. Nonetheless, the efficacy of these methods hinges upon the assumptions underlying the models. This paper introduces innovative and robust data integration approaches, notably a semi‐parametric quantile regression‐based mass‐imputation (Mass‐Imp.) approach and a doubly‐robust approach that integrates a nonparametric estimator of the participation probability for non‐probability samples. Our proposed methodologies exhibit greater robustness compared to existing parametric approaches, particularly concerning model misspecification and outliers. Theoretical results are established, including variance estimators for our proposed estimators. Through comprehensive simulation studies and real‐world applications, our findings demonstrate the promising performance of the proposed estimators in facilitating valid statistical inference. This research contributes to the advancement of robust methodologies for handling non‐probability samples, thereby enhancing the reliability and validity of research outcomes across diverse domains.
- Research Article
- 10.1111/sjos.70018
- Sep 16, 2025
- Scandinavian Journal of Statistics
- Anna Bonnet + 3 more
Abstract Classic estimation methods for Hawkes processes rely on the assumption that observed event times are indeed a realization of a Hawkes process, without considering any perturbation of the model. In practice, observations are often altered by some noise, and so we consider, in this work, the observations to be the indistinguishable union of event times coming from a Hawkes process and from an independent Poisson process. Since standard inference methods are either unworkable or numerically prohibitive, we propose an estimation procedure based on the spectral analysis of second‐order properties of the process. Novel results include sufficient conditions for identifiability of the model: Although we mainly focus on the exponential scenario, other types of kernels are investigated. We propose a new estimator based on maximizing the spectral log‐likelihood that, besides being free from knowing the source of each observed time, is shown to perform accurately in estimating both processes.
- Research Article
- 10.1111/sjos.70019
- Sep 16, 2025
- Scandinavian Journal of Statistics
- Kai Xu
Abstract The topic of maximum mean discrepancy has been prominent in statistical analyses of multivariate two‐sample goodness of fit. Despite its usefulness, one major bottleneck is the testing process: Because the null distribution of maximum mean discrepancy depends on the underlying distribution, it typically requires a permutation test to estimate the null and compute the ‐value, which is very costly for a large amount of data. To overcome the difficulty, in this article, we propose combined probability tests based on the equality of the means of a characteristic kernel, which arises in maximum mean discrepancy analysis of multivariate data. The combined probability tests are shown to be asymptotically distribution‐free and therefore have well‐known critical values. We further show that the proposed tests are consistent against all fixed alternatives under the existence of the second moment of the characteristic kernel. A local power analysis provides strong support for the new approach by establishing the nontrivial power of our tests over square root‐ neighborhoods. We illustrate the advantages of the proposed method via simulation studies and a gene expression dataset analysis.
- Research Article
- 10.1111/sjos.70017
- Sep 12, 2025
- Scandinavian Journal of Statistics
- Ariane Marandon + 3 more
Abstract The clustering task consists in partitioning elements of a sample into homogeneous groups. Most datasets contain individuals that are ambiguous and intrinsically difficult to attribute to one or another cluster. However, in practical applications, misclassifying individuals is potentially disastrous and should be avoided. To keep the misclassification rate small, one can decide to classify only a part of the sample. In the supervised setting, this approach is well known and referred to as classification with an abstention option. In this paper, the approach is revisited in an unsupervised mixture‐model framework. The purpose is to develop a method that guarantees the false selection rate (FSR) does not exceed a predefined level . We propose a plug‐in procedure and provide a theoretical analysis, quantifying the deviation of the FSR from the target with explicit remainder terms. Bootstrap versions of the procedure are shown to improve the performance in numerical experiments.
- Research Article
- 10.1111/sjos.70015
- Sep 4, 2025
- Scandinavian Journal of Statistics
- Adrien Corenflos + 2 more
Abstract Monte Carlo methods—such as Markov chain Monte Carlo (MCMC) and piecewise deterministic Markov process (PDMP) samplers—provide asymptotically exact estimators of expectations under a target distribution. There is growing interest in alternatives to this asymptotic regime, in particular in constructing estimators that are exact in the limit of an infinite number of computing processors, rather than in the limit of an infinite number of Markov iterations. In particular, coupled MCMC estimators remove the non‐asymptotic bias, resulting in MCMC estimators that can be embarrassingly parallelized. In this work, we extend these estimators to the continuous‐time context and derive couplings for the bouncy, the boomerang, and the coordinate samplers. Some preliminary empirical results are included that demonstrate the reasonable scaling of our method with the dimension of the target.
- Research Article
- 10.1111/sjos.70016
- Sep 2, 2025
- Scandinavian Journal of Statistics
- Rémi Boutin + 2 more
Abstract Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualization of the data is mandatory. To address both issues, we introduce Deep‐LPTM, a model‐based clustering strategy relying on a variational graph auto‐encoder approach and a probabilistic model to characterize the discussion topics. Deep‐LPTM allows to build a joint representation of the nodes and the edges in two embedding spaces. The parameters are inferred using a variational inference algorithm. We also introduce IC2L, a model selection criterion specifically designed to choose models with relevant clustering and visualization properties. An extensive benchmark study on synthetic data is provided. In particular, we find that Deep‐LPTM better recovers the partitions of the nodes than the state‐of‐the‐art ETSBM and STBM. Eventually, the emails of the Enron company are analyzed and visualizations of the results are presented, with meaningful highlights of the graph structure.