The radius of statistical efficiency
The radius of statistical efficiency
- Research Article
- 10.9734/arjom/2025/v21i121021
- Dec 6, 2025
- Asian Research Journal of Mathematics
The paper proposed an additive optional randomized response technique model that improves upon Gjestvang and Singh (2009) model by effectively balancing respondents privacy protection and statistical estimation efficiency. The proposed model establishes an unbiased estimator of the population mean under both simple random sampling and probability proportional to size sampling schemes. The proposed model effectively balances the privacy protection with statistical efficiency – a key trade-off in survey design involving sensitive variable. For all values of scrambling parameters and sensitivity level, the proposed model recorded high gain in efficiency and the relative efficiency of the proposed model under both sampling scheme is greater than one. As sensitivity level increases, the relative gain in efficiency decreases which is in agreement with theoretical expectations. Nevertheless, even at high sensitivity level W = 0.9, the proposed model maintained acceptable efficiency and unbiasedness. The weighted privacy-efficiency measure established that proposed model out-performed Gjestvang and Singh (2009) model.
- Conference Article
12
- 10.1109/icdm.2014.23
- Dec 1, 2014
Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of χ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches -- statistical efficiency -- 2) controls for the risk of false discoveries as well as statistical approaches -- high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer -- computational efficiency.
- Research Article
1
- 10.1080/10543406.2024.2373436
- Jul 24, 2024
- Journal of Biopharmaceutical Statistics
In a prospective observational study (POS) designed to assess the average causal effect of a treatment (e.g. Drug A) compared to a comparator (e.g. Drug B) in the treatment population, enrolling all patients who are assigned to the treatments of interest for follow-up has a potentially large negative impact on the statistical efficiency and bias of the analysis of the outcomes and on the cost of the study. “Up-front matching” is an innovative enrollment method for selecting patients for long-term follow-up among those who have already been assigned to treatment or comparator which uses frequency matching and hence avoids the restrictions of individual matching that other methods have used. To achieve potential statistical and logistical efficiencies in the POS, in up-front matching, a target population is defined based on a retrospective database which then enables selecting populations of patients for follow-up that have desirable statistical properties. In particular, the resulting populations of patients who are enrolled look like the population of treatment patients were randomized to treatment or comparator for the baseline covariates that are used to select patients for follow-up. The method is illustrated in detail for a study designed to assess the effect of injectable antipsychotics versus oral antipsychotics.
- Research Article
10
- 10.1016/j.conctc.2018.01.002
- Jan 10, 2018
- Contemporary Clinical Trials Communications
Investigating the impact of design characteristics on statistical efficiency within discrete choice experiments: A systematic survey
- Research Article
11
- 10.1093/jrsssb/qkad059
- Jun 10, 2023
- Journal of the Royal Statistical Society Series B: Statistical Methodology
Distributed computing is critically important for modern statistical analysis. Herein, we develop a distributed quasi-Newton (DQN) framework with excellent statistical, computation, and communication efficiency. In the DQN method, no Hessian matrix inversion or communication is needed. This considerably reduces the computation and communication complexity of the proposed method. Notably, related existing methods only analyse numerical convergence and require a diverging number of iterations to converge. However, we investigate the statistical properties of the DQN method and theoretically demonstrate that the resulting estimator is statistically efficient over a small number of iterations under mild conditions. Extensive numerical analyses demonstrate the finite sample performance.
- Research Article
7
- 10.1002/sta4.194
- Jan 1, 2018
- Stat
Conventional likelihood‐based model inference methods, such as maximum likelihood estimation and Bayesian inference, are computationally expensive for many spatial models with large data sets. As an alternative inference tool, composite likelihood (CL) methods have gained considerable attention in recent years because of their simplicity and sound asymptotic properties. However, CL estimators often result in substantial loss in statistical efficiency with respect to maximum likelihood estimation. In this paper, we propose a new weight function to construct CL for the inference of spatial Gaussian process models. This weight function approximates the optimal weight derived from the theory of estimating equations. It combines block‐diagonal approximation and the tapering strategy to facilitate computations. Gains in statistical and computational efficiency over existing CL methods are illustrated through simulation studies. © 2018 John Wiley & Sons, Ltd.
- Research Article
2
- 10.1007/s13137-020-00163-x
- Sep 21, 2020
- GEM - International Journal on Geomathematics
Bayesian modelling and computational inference by Markov chain Monte Carlo (MCMC) is a principled framework for large-scale uncertainty quantification, though is limited in practice by computational cost when implemented in the simplest form that requires simulating an accurate computer model at each iteration of the MCMC. The delayed acceptance Metropolis–Hastings MCMC leverages a reduced model for the forward map to lower the compute cost per iteration, though necessarily reduces statistical efficiency that can, without care, lead to no reduction in the computational cost of computing estimates to a desired accuracy. Randomizing the reduced model for the forward map can dramatically improve computational efficiency, by maintaining the low cost per iteration but also avoiding appreciable loss of statistical efficiency. Randomized maps are constructed by a posteriori adaptive tuning of a randomized and locally-corrected deterministic reduced model. Equivalently, the approximated posterior distribution may be viewed as induced by a modified likelihood function for use with the reduced map, with parameters tuned to optimize the quality of the approximation to the correct posterior distribution. Conditions for adaptive MCMC algorithms allow practical approximations and algorithms that have guaranteed ergodicity for the target distribution. Good statistical and computational efficiencies are demonstrated in examples of calibration of large-scale numerical models of geothermal reservoirs and electrical capacitance tomography.
- Research Article
1829
- 10.1002/(sici)1097-0193(1999)8:2/3<109::aid-hbm7>3.0.co;2-w
- Jan 1, 1999
- Human Brain Mapping
An important challenge in the design and analysis of event-related or single-trial functional magnetic resonance imaging (fMRI) experiments is to optimize statistical efficiency, i.e., the accuracy with which the event-related hemodynamic response to different stimuli can be estimated for a given amount of imaging time. Several studies have suggested that using a fixed inter-stimulus-interval (ISI) of at least 15 sec results in optimal statistical efficiency or power and that using shorter ISIs results in a severe loss of power. In contrast, recent studies have demonstrated the feasibility of using ISIs as short as 500 ms while still maintaining considerable efficiency or power. Here, we attempt to resolve this apparent contradiction by a quantitative analysis of the relative efficiency afforded by different event-related experimental designs. This analysis shows that statistical efficiency falls off dramatically as the ISI gets sufficiently short, if the ISI is kept fixed for all trials. However, if the ISI is properly jittered or randomized from trial to trial, the efficiency improves monotonically with decreasing mean ISI. Importantly, the efficiency afforded by such variable ISI designs can be more than 10 times greater than that which can be achieved by fixed ISI designs. These results further demonstrate the feasibility of using identical experimental designs with fMRI and electro-/magnetoencephalography (EEG/MEG) without sacrificing statistical power or efficiency of either technique, thereby facilitating comparison and integration across imaging modalities.
- Research Article
1
- 10.1287/ijoc.2024.0554
- Jul 15, 2025
- INFORMS Journal on Computing
Federated learning is a novel framework for distributed learning, which aims to break isolated data islands, as well as protect data privacy. To further prevent privacy leakage by specially crafted attacks, differential privacy is often integrated. Although differential privacy effectively secures sensitive information, it can reduce the statistical efficiency of the resulting estimators. This leads to a trade-off relationship between statistical efficiency and privacy protection. To theoretically understand this relationship, we start with the classic linear regression model and a noise-adding federated gradient descent algorithm. Its numerical convergence properties and asymptotic properties are rigorously studied. This results in fruitful insights into the trade-off relationship between statistical efficiency and privacy protection. Guided by these theoretical understandings, we further develop a Polyak-Ruppert-type averaged estimator, which can achieve good statistical efficiency with guaranteed privacy protection. Extensive simulation studies are presented to corroborate our theoretical results. Finally, we illustrate the application of our proposed method on an enterprise community data set. History: Accepted by Ram Ramesh, Area Editor for Data Science and Machine Learning. Funding: Financial support from the National Natural Science Foundation of China [Grants 12401386, 72371241, 72495123, and 12271012], the Ministry of Education Project of Key Research Institute of Humanities and Social Sciences [Grant 22JJD910001], the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation [Grant GZB20230070], and the Beijing Municipal Social Science Foundation [Grant 24GLC033] is gratefully acknowledged. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2024.0554 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2024.0554 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .
- Research Article
20
- 10.5555/1756006.1859933
- Mar 1, 2010
- Journal of Machine Learning Research
We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β* ∈ Rp, estimate the subset of non-zero entries of β*. A significant body of work has studied behavior of l1-relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of non-zero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions.
- Research Article
38
- 10.1109/tpami.2020.3044997
- Dec 15, 2020
- IEEE Transactions on Pattern Analysis and Machine Intelligence
In this paper, we propose a general framework termed centroid estimation with guaranteed efficiency (CEGE) for weakly supervised learning (WSL) with incomplete, inexact, and inaccurate supervision. The core of our framework is to devise an unbiased and statistically efficient risk estimator that is applicable to various weak supervision. Specifically, by decomposing the loss function (e.g., the squared loss and hinge loss) into a label-independent term and a label-dependent term, we discover that only the latter is influenced by the weak supervision and is related to the centroid of the entire dataset. Therefore, by constructing two auxiliary pseudo-labeled datasets with synthesized labels, we derive unbiased estimates of centroid based on the two auxiliary datasets, respectively. These two estimates are further linearly combined with a properly decided coefficient which makes the final combined estimate not only unbiased but also statistically efficient. This is better than some existing methods that only care about the unbiasedness of estimation but ignore the statistical efficiency. The good statistical efficiency of the derived estimator is guaranteed as we theoretically prove that it acquires the minimum variance when estimating the centroid. As a result, intensive experimental results on a large number of benchmark datasets demonstrate that our CEGE generally obtains better performance than the existing approaches related to typical WSL problems including semi-supervised learning, positive-unlabeled learning, multiple instance learning, and label noise learning.
- Conference Article
3
- 10.1109/isit.2008.4595379
- Jul 1, 2008
We consider the problem of estimating the support of a vector beta* isin R" W based on observations contaminated by noise. A significant body of work has studied behavior of lscr <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> -relaxations when applied to measurement matrices drawn from standard dense ensembles (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction 7 of non-zero entries, and the statistical efficiency, as measured by the minimal number of observations n required for exact support recovery with probability converging to one. Our main result is to prove that it is possible to let gamma rarr 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row while retaining the same statistical efficiency (sample size n) as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions.
- Research Article
76
- 10.1016/j.neuroimage.2021.118786
- Dec 11, 2021
- NeuroImage
Hyperbolic trade-off: The importance of balancing trial and subject sample sizes in neuroimaging
- Research Article
13
- 10.1016/j.jneumeth.2013.02.017
- Mar 7, 2013
- Journal of Neuroscience Methods
In vivo evaluation of the effect of stimulus distribution on FIR statistical efficiency in event-related fMRI
- Conference Article
41
- 10.1109/ipdps.2016.107
- May 1, 2016
Stochastic Gradient Descent (SGD) is a popular optimization method used to train a variety of machine learning models. Most of SGD work to-date has concentrated on improving its statistical efficiency, in terms of rate of convergence to the optimal solution. At the same time, as parallelism of modern CPUs continues to increase through progressively higher core counts, it is imperative to understand the parallel hardware efficiency of SGD, which often comes at odds with its statistical efficiency. In this paper, we explore several modern parallelization methods of SGD on a shared memory system, in the context of sparse and convex optimization problems. Specifically, we develop optimized parallel implementations of several SGD algorithms, and show that their parallel efficiency is severely limited by inter-core communication. We propose a new, scalable, communication-avoiding, many-core friendly implementation of SGD, called HogBatch, which exposes parallelism on several levels, minimizes the impact on statistical efficiency, and, as a result significantly outperforms the other methods. On a variety of datasets, HogBatch demonstrates near linear scalability on a system with 14 cores, as well as delivers up to a 20X speedup over previous methods.