Published in last 50 years
Articles published on Biased Coefficient Estimates
- Research Article
- 10.54103/2282-0930/29431
- Sep 8, 2025
- Epidemiology, Biostatistics, and Public Health
- Samuele Minari + 3 more
INTRODUCTION Variable selection is a common step in clinical research, where large datasets often include many, potentially highly correlated, variables. The main objective is to identify the most relevant predictors for an outcome, thereby enhancing model interpretability, simplicity, and predictive performance [1]. However, data-driven variable selection also carries several underappreciated risks. These include the potential exclusion of important predictors, inclusion of irrelevant ones, biased coefficient estimates, underestimated standard errors, invalid confidence intervals, and overall model instability [2]. Simulation studies are a valuable approach for evaluating statistical methods, provided they are carefully designed. Yet, many such studies exhibit bias in favor of the newly proposed methods [3]. To address this, we developed a neutral comparison simulation study to fairly evaluate the performance of several variable selection techniques. OBJECTIVE To systematically evaluate and compare different variable selection methods across multiple simulated scenarios. METHODS To improve the design and reporting of our simulation study, we followed the ADEMP structure [4], this involves specifying the aim (A), the data-generating process (D), the estimand or target of inference (E), the analytical methods (M), and the criteria used to evaluate performance (P). We designed different simulation scenarios by varying the number of observations, total variables, and number of true predictors. Predictor correlations were modeled to decay exponentially with increasing distance between variables, and effect sizes for true predictors were varied [5, 6]. Noise was introduced into the correlation structures to better mimic real-world data. We focused on a binary classification setting, evaluating each method on two key outcomes: model selection accuracy (i.e. whether the true model is selected) and predictive performance. Five methods for selecting variables were compared: stepwise logistic regression, LASSO logistic regression, Elastic net logistic regression, Random Forest Classifier with OOB error based backward elimination [7] and Genetic Algorithm (GA) [8, 9]. Performance metrics included the Area Under the Curve (AUC), number of variables selected, and True Positive Rate (TPR). All the analyses were performed using Python 3.12. RESULTS We ran 1,000 Monte Carlo simulations per scenario, varying key factors such as sample size, number of predictors, true signal strength, and correlation strength. Elastic Net consistently achieved the highest mean AUC and TPR, particularly in high-dimensional or strong-signal settings (e.g., Scenarios 5–8), showing robust performance across conditions. Random Forest and Genetic Algorithm performed comparably in some scenarios but incurred substantially higher computational costs. LASSO achieved competitive AUC with significantly lower runtime, though it tended to underselect in weaker signal scenarios. Stepwise selection, while the fastest method, had the lowest overall predictive performance and true positive rates (Table 1). CONCLUSION Among the five evaluated methods, Elastic Net provided the best trade-off between predictive performance and model stability, particularly in realistic, high-dimensional settings. Our results reinforce the importance of carefully considering the variable selection method in the context of the data structure and research goals. This neutral comparison contributes to evidence-based guidance for method selection in clinical research and similar applied settings.
- Research Article
- 10.1037/met0000792
- Aug 7, 2025
- Psychological methods
- Timothy R Konold + 1 more
The manifest aggregation of scores from both persons and items in multilevel modeling has been previously shown to result in biased estimates of predictor-outcome regression coefficients when used in the context of formative variables-variables in which the level two (L2) aggregate is formed from level one scores but where individuals are still expected to vary within clusters. Solutions to this problem have been offered in the form of partially and fully latent variable modeling specifications. The current study revisits this issue in the context of reflective variables, situations in which level one scores are obtained for the sole purpose of measuring an L2 variable. Under design specifications consistent with an L2 reflective variable, the current study uses population formula-based computations as well as Monte Carlo simulations to show conditions under which researchers may use a manifest aggregation of scores from persons and items for evaluating L2 relationships, thereby overcoming model convergence and identification challenges related to using latent variable modeling. We also highlight instances in which latent aggregations should be preferred. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
- Research Article
- 10.1080/00273171.2024.2307034
- Jan 18, 2024
- Multivariate Behavioral Research
- Mark H C Lai + 2 more
With clustered data, such as where students are nested within schools or employees are nested within organizations, it is often of interest to estimate and compare associations among variables separately for each level. While researchers routinely estimate between-cluster effects using the sample cluster means of a predictor, previous research has shown that such practice leads to biased estimates of coefficients at the between level, and recent research has recommended the use of latent cluster means with the multilevel structural equation modeling framework. However, the latent cluster mean approach may not always be the best choice as it (a) relies on the assumption that the population cluster sizes are close to infinite, (b) requires a relatively large number of clusters, and (c) is currently only implemented in specialized software such as Mplus. In this paper, we show how using empirical Bayes estimates of the cluster means can also lead to consistent estimates of between-level coefficients, and illustrate how the empirical Bayes estimate can incorporate finite population corrections when information on population cluster sizes is available. Through a series of Monte Carlo simulation studies, we show that the empirical Bayes cluster-mean approach performs similarly to the latent cluster mean approach for estimating the between-cluster coefficients in most conditions when the infinite-population assumption holds, and applying the finite population correction provides reasonable point and interval estimates when the population is finite. The performance of EBM can be further improved with restricted maximum likelihood estimation and likelihood-based confidence intervals. We also provide an R function that implements the empirical Bayes cluster-mean approach, and illustrate it using data from the classic High School and Beyond Study.
- Research Article
2
- 10.34133/hds.0196
- Jan 1, 2024
- Health data science
- Siqi Li + 13 more
Background: Federated learning (FL) holds promise for safeguarding data privacy in healthcare collaborations. While the term "FL" was originally coined by the engineering community, the statistical field has also developed privacy-preserving algorithms, though these are less recognized. Our goal was to bridge this gap with the first comprehensive comparison of FL frameworks from both domains. Methods: We assessed 7 FL frameworks, encompassing both engineering-based and statistical FL algorithms, and compared them against local and centralized modeling of logistic regression and least absolute shrinkage and selection operator (Lasso). Our evaluation utilized both simulated data and real-world emergency department data, focusing on comparing both estimated model coefficients and the performance of model predictions. Results: The findings reveal that statistical FL algorithms produce much less biased estimates of model coefficients. Conversely, engineering-based methods can yield models with slightly better prediction performance, occasionally outperforming both centralized and statistical FL models. Conclusion: This study underscores the relative strengths and weaknesses of both types of methods, providing recommendations for their selection based on distinct study characteristics. Furthermore, we emphasize the critical need to raise awareness of and integrate these methods into future applications of FL within the healthcare domain.
- Research Article
8
- 10.1016/j.ins.2023.119893
- Nov 13, 2023
- Information Sciences
- Georgios Charizanos + 2 more
This article proposes a new fuzzy logistic regression framework with high classification performance against imbalance and separation while keeping the interpretability of classical logistic regression. Separation and imbalance are two core problems in logistic regression, which can result in biased coefficient estimates and inaccurate predictions. Existing research on fuzzy logistic regression primarily focuses on developing possibilistic models instead of using a logit link function that converts log-odds ratios to probabilities. At the same time, little consideration is given to issues of separation and imbalance. Our study aims to address these challenges by proposing new methods of fuzzifying binary variables and classifying subjects based on a comparison against a fuzzy threshold. We use combinations of fuzzy and crisp predictors, output, and coefficients to understand which combinations perform better under imbalance and separation. Numerical experiments with synthetic and real datasets are conducted to demonstrate the usefulness and superiority of the proposed framework. Seven crisp machine learning models are implemented for benchmarking in the numerical experiments. The proposed framework shows consistently strong performance results across datasets with imbalance or separation and performs equally well when such issues are absent. Meanwhile, the considered machine learning methods are significantly impacted by the imbalanced datasets.
- Research Article
- 10.1016/j.tej.2023.107287
- Jun 1, 2023
- The Electricity Journal
- Xiaodong Du + 2 more
Wholesale price dynamics in the evolving Texas power grid
- Research Article
- 10.1051/e3sconf/202340902010
- Jan 1, 2023
- E3S Web of Conferences
- Syed Ejaz Ahmed + 2 more
This paper focuses on estimating the Self-Exciting Threshold Autoregressive (SETAR) type time-series model under right-censored data. As is known, the SETAR model is used when the underlying function of the relation-ship between the time-series itself (Yt), and its p delays $$({Y_{t - j}})_{j = 1}^p$$ violates the lin-earity assumption and this function is formed by multiple behaviors that called regime. This paper addresses the right-censored dependent time-series problem which has a serious negative effect on the estimation performance. Right-censored time series cause biased coefficient estimates and unqualified predictions. The main contribution of this paper is solving the censorship problem for the SETAR by three different techniques that are kNN imputation which represents the imputation techniques, Kaplan-Meier weights that is applied based on the weighted least squares, synthetic data transformation which adds the effect of censorship to the modeling process by manipulating dataset. Then, these solutions are combined by the SETAR-type model estimation process. To observe the behavior of the nonlinear estimators in practice, a simulation study and a real data example are carried out. The Covid-19 dataset collected in China is used as real data. Results prove that although the three estimators show satisfying performance, the quality of the estimate SETAR model based on the kNN imputation technique dominates the other two estimators.
- Research Article
1
- 10.1111/2041-210x.14022
- Nov 10, 2022
- Methods in Ecology and Evolution
- Liang Xu + 3 more
Abstract Estimating the strength of interactions among species in natural communities has always been a challenge for empirical ecologists. Sessile organisms, like plants or corals, often occur in metacommunities where they compete only with their immediate neighbours but disperse propagules over a wider area. To estimate the strength of competitive interactions, ecologists often count abundances in cells on a spatial grid for at least two time‐points. This data is then analysed using regression, by modelling the change in population size as a function of local densities, using cells as independent data‐points: a technique known as space‐for‐time substitution. These analyses generate estimates of competition coefficients; however, the method ignores dispersal among cells. To determine the impact of ignoring dispersal, we derived the bias that would arise when we apply regression methods to a metacommunity in which a fraction of seeds disperse beyond their natal cells but this dispersal is ignored in the model fitting process. We present results from a range of population models that make different assumptions about the nature of competition and assess the performance of our bias formulae by analysing data from simulated metacommunities. We reveal that: estimates of competition coefficients are biased when dispersal is not properly accounted for; and the resulting bias is often correlated with abundance, with rare species suffering the greatest overestimation. We also provide a standardized metric of competition that allows the bias to be calculated for a broad range of other population models. Our study suggests that regression methods that ignore dispersal produce biased estimates of competition coefficients when using space‐for‐time substitution. Our analytical bias formula allows empirical ecologists to potentially correct for biases, but it requires either tailored experiments in controlled conditions or an estimate of the average dispersal rate in a natural community, so may be challenging to apply to real datasets.
- Research Article
2
- 10.1080/02664763.2022.2138838
- Oct 27, 2022
- Journal of Applied Statistics
- Cindy Feng + 1 more
In many epidemiological and environmental health studies, developing an accurate exposure assessment of multiple exposures on a health outcome is often of interest. However, the problem is challenging in the presence of multicollinearity, which can lead to biased estimates of regression coefficients and inflated variance estimators. Selecting one exposure variable as a surrogate of multiple highly correlated exposure variables is often suggested in the literature as a solution to handle the multicollinearity problem. However, this may lead to loss of information, since the exposure variables that are highly correlated tend to have not only common but also additional effects on the outcome variable. In this study, a two-stage latent factor regression method is proposed. The key idea is to regress the dependent variable not only on the common latent factor(s) of the explanatory variables, but also on the residuals terms from the factor analysis as the explanatory variables. The proposed method is compared to the traditional latent factor regression and principal component regression for their performance of handling multicollinearity. Two case studies are presented. Simulation studies are performed to assess their performances in terms of the epidemiological interpretation and stability of parameter estimates.
- Research Article
3
- 10.1080/10705511.2022.2125397
- Oct 25, 2022
- Structural Equation Modeling: A Multidisciplinary Journal
- Mark H C Lai + 4 more
In path analysis, using composite scores without adjustment for measurement unreliability and violations of factorial invariance across groups lead to biased estimates of path coefficients. Although joint modeling of measurement and structural models can theoretically yield consistent structural association estimates, estimating a model with many variables is often impractical in small samples. A viable alternative is two-stage path analysis (2S-PA), where researchers first obtain factor scores and the corresponding individual-specific reliability coefficients, and then use those factor scores to analyze structural associations while accounting for their unreliability. The current paper extends 2S-PA to also account for partial invariance. Two simulation studies show that 2S-PA outperforms joint modeling in terms of model convergence, the efficiency of structural parameter estimation, and confidence interval coverage, especially in small samples and with categorical indicators. We illustrate 2S-PA by reanalyzing data from a multiethnic study that predicts drinking problems using college-related alcohol beliefs.
- Research Article
4
- 10.1186/s12874-022-01641-6
- Jun 9, 2022
- BMC Medical Research Methodology
- Angelika Geroldinger + 3 more
BackgroundIn binary logistic regression data are ‘separable’ if there exists a linear combination of explanatory variables which perfectly predicts the observed outcome, leading to non-existence of some of the maximum likelihood coefficient estimates. A popular solution to obtain finite estimates even with separable data is Firth’s logistic regression (FL), which was originally proposed to reduce the bias in coefficient estimates. The question of convergence becomes more involved when analyzing clustered data as frequently encountered in clinical research, e.g. data collected in several study centers or when individuals contribute multiple observations, using marginal logistic regression models fitted by generalized estimating equations (GEE). From our experience we suspect that separable data are a sufficient, but not a necessary condition for non-convergence of GEE. Thus, we expect that generalizations of approaches that can handle separable uncorrelated data may reduce but not fully remove the non-convergence issues of GEE.MethodsWe investigate one recently proposed and two new extensions of FL to GEE. With ‘penalized GEE’ the GEE are treated as score equations, i.e. as derivatives of a log-likelihood set to zero, which are then modified as in FL. We introduce two approaches motivated by the equivalence of FL and maximum likelihood estimation with iteratively augmented data. Specifically, we consider fully iterated and single-step versions of this ‘augmented GEE’ approach. We compare the three approaches with respect to convergence behavior, practical applicability and performance using simulated data and a real data example.ResultsOur simulations indicate that all three extensions of FL to GEE substantially improve convergence compared to ordinary GEE, while showing a similar or even better performance in terms of accuracy of coefficient estimates and predictions. Penalized GEE often slightly outperforms the augmented GEE approaches, but this comes at the cost of a higher burden of implementation.ConclusionsWhen fitting marginal logistic regression models using GEE on sparse data we recommend to apply penalized GEE if one has access to a suitable software implementation and single-step augmented GEE otherwise.
- Research Article
2
- 10.1111/issr.12287
- Jan 1, 2022
- International Social Security Review
- Tero Lähderanta + 3 more
Abstract Using unique administrative register data, we investigate old‐age retirement under the statutory pension scheme in Finland. The analysis is based on multi‐outcome modelling of pensions and working lives together with a range of explanatory variables. An adaptive multi‐outcome LAD‐lasso regression method is applied to obtain estimates of earnings and socioeconomic factors affecting old‐age retirement and to decide which of these variables should be included in our model. The proposed statistical technique produces robust and less biased regression coefficient estimates in the context of skewed outcome distributions and an excess number of zeros in some of the explanatory variables. The results underline the importance of late life course earnings and employment to the final amount of pension and reveal differences in pension outcomes across socioeconomic groups. We conclude that adaptive LAD‐lasso regression is a promising statistical technique that could be usefully employed in studying various topics in the pension industry.
- Research Article
4
- 10.1080/13658816.2021.1988088
- Oct 18, 2021
- International Journal of Geographical Information Science
- Wangshu Mu + 1 more
ABSTRACT Distance is one of the most important concepts in geography and spatial analysis. Since distance calculation is straightforward for points, measuring distances for non-point objects often involves abstracting them into their representative points. For example, a polygon is often abstracted into its centroid, and the distance from/to the polygon is then measured using the centroid. Despite the wide use of representative points to measure distances of non-point objects, a recent study has shown that such a practice might be problematic and lead to biased coefficient estimates in regression analysis. The study proposed a new polygon-to-point distance metric, along with two computation algorithms. However, the efficiency of these distance calculation algorithms is low. This research provides three new methods, including the random point-based method, polygon partitioning method, and axis-aligned minimum areal bounding box-based (MABB-based) method, to compute the new distance metric. Tests are provided to compare the accuracy and computational efficiency of the new algorithms. The test results show that each of the three new methods has its advantages: the random point-based method is easy to implement, the polygon partitioning method is most accurate, and the MABB-based method is computationally efficient.
- Research Article
14
- 10.2308/ajpt-2021-004
- Aug 30, 2021
- Auditing: A Journal of Practice & Theory
- James R Moon + 3 more
SUMMARY Ex ante misstatement risk confounds most settings relying on misstatements as a measure of audit quality, but researchers continue to debate how to effectively control for this construct. In this study, we consider a recent approach that involves controlling for prior period misstatements (“Lagged Misstatements”). Using a controlled simulation and a basic archival analysis, we show that a lagged misstatement control can significantly bias coefficient estimates. We demonstrate this bias using audit fees as a variable of interest but also show the same issue manifests for other measures that respond to the restatement of misstated financial statements (i.e., internal control material weaknesses and auditor changes). We conclude by discussing alternative approaches for controlling for ex ante misstatement risk and providing guidance for future research. Data Availability: All data used are publicly available from sources cited in the text. JEL Classifications: M40; M41; M42.
- Research Article
5
- 10.1080/1351847x.2021.1900888
- Mar 19, 2021
- The European Journal of Finance
- Shima Amini + 3 more
We show that expected returns on US stocks and all major global stock market indices have a particular form of non-linear dependence on previous returns. The expected sign of returns tends to reverse after large price movements and trends tend to continue after small movements. The observed market properties are consistent with various models of investor behaviour and can be captured by a simple polynomial model. We further discuss a number of important implications of our findings. Incorrectly fitting a simple linear model to the data leads to a substantial bias in coefficient estimates. We show through the polynomial model that well-known short-term technical trading rules may be substantially driven by the non-linear behaviour observed. The behaviour also has implications for the appropriate calculation of important risk measures such as value at risk.
- Research Article
- 10.2139/ssrn.3857752
- Jan 1, 2021
- SSRN Electronic Journal
- James Moon + 3 more
On Controlling for Misstatement Risk
- Research Article
10
- 10.1016/j.trf.2020.11.002
- Dec 7, 2020
- Transportation Research Part F: Traffic Psychology and Behaviour
- Yunchang Zhang + 1 more
Investigating temporal variations in pedestrian crossing behavior at semi-controlled crosswalks: A Bayesian multilevel modeling approach
- Research Article
167
- 10.1186/s12874-020-01080-1
- Jul 25, 2020
- BMC Medical Research Methodology
- Shangzhi Hong + 1 more
BackgroundMissing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions.MethodsTo examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM).ResultsBoth missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction.ConclusionsRF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.
- Research Article
25
- 10.1186/s12893-020-00816-6
- Jul 13, 2020
- BMC Surgery
- Atefeh Talebi + 7 more
BackgroundGastric cancer (GC) has been considered as the 5th most common type of cancer and the third leading cause of cancer-associated death worldwide. The aim of this historical cohort study was to evaluate the survival predictors for all patients with GC using the Cox proportional hazards, extended Cox, and gamma-frailty models.MethodsThis historical cohort study was performed according to documents of 1695 individuals having GC referred to three medical centers in Iran from 2001 to 2018. First, most significant prognostic risk factors on survival were selected, Cox proportional hazards, extended Cox, gamma-frailty models were applied to evaluate the effects of the risk factors, and then these models were compared with the Akaike information criterion.ResultsThe age of patients, body mass index (BMI), tumor size, type of treatment and grade of the tumor increased the hazard rate (HR) of GC patients in both the Cox and frailty models (P < 0.05). Also, the size of the tumor and BMI were considered as time-varying variables in the extended Cox model. Moreover, the frailty model showed that there is at least an unknown factor, genetic or environmental factors, in the model that is not measured (P < 0.05).ConclusionsSome prognostic factors, including age, tumor size, the grade of the tumor, type of treatment and BMI, were regarded as indispensable predictors in patients of GC. Frailty model revealed that there are unknown or latent factors, genetic and environmental factors, resulting in the biased estimates of the regression coefficients.
- Research Article
99
- 10.1017/s0007123419000097
- May 13, 2020
- British Journal of Political Science
- Martin Elff + 3 more
Abstract Quantitative comparative social scientists have long worried about the performance of multilevel models when the number of upper-level units is small. Adding to these concerns, an influential Monte Carlo study by Stegmueller (2013) suggests that standard maximum-likelihood (ML) methods yield biased point estimates and severely anti-conservative inference with few upper-level units. In this article, the authors seek to rectify this negative assessment. First, they show that ML estimators of coefficients are unbiased in linear multilevel models. The apparent bias in coefficient estimates found by Stegmueller can be attributed to Monte Carlo Error and a flaw in the design of his simulation study. Secondly, they demonstrate how inferential problems can be overcome by usingrestrictedML estimators for variance parameters and at-distribution with appropriate degrees of freedom for statistical inference. Thus, accurate multilevel analysis is possible within the framework that most practitioners are familiar with, even if there are only a few upper-level units.