A propensity score approach in the impact evaluation on scientific production in Brazilian biodiversity research: the BIOTA Program
Evaluation has become a regular practice in the management of science, technology and innovation (ST&I) programs. Several methods have been developed to identify the results and impacts of programs of this kind. Most evaluations that adopt such an approach conclude that the interventions concerned, in this case ST&I programs, had a positive impact compared with the baseline, but do not control for any effects that might have improved the indicators even in the absence of intervention, such as improvements in the socio-economic context. The quasi-experimental approach therefore arises as an appropriate way to identify the real contributions of a given intervention. This paper describes and discusses the utilization of propensity score (PS) in quasi-experiments as a methodology to evaluate the impact on scientific production of research programs, presenting a case study of the BIOTA Program run by FAPESP, the State of Sao Paulo Research Foundation (Brazil). Fundamentals of quasi-experiments and causal inference are presented, stressing the need to control for biases due to lack of randomization, also a brief introduction to the PS estimation and weighting technique used to correct for observed bias. The application of the PS methodology is compared to the traditional multivariate analysis usually employed.
- Research Article
16
- 10.1176/appi.ps.61.2.137
- Feb 1, 2010
- Psychiatric Services
Effectiveness and Outcomes of Assisted Outpatient Treatment in New York State
- Research Article
78
- 10.1176/ps.2010.61.2.137
- Feb 1, 2010
- Psychiatric Services
Outpatient commitment has been heralded as a necessary intervention that improves psychiatric outcomes and quality of life, and it has been criticized on the grounds that effective treatment must be voluntary and that outpatient commitment has negative unintended consequences. Because few methodologically strong data exist, this study evaluated New York State's outpatient commitment program with the objective of augmenting the existing literature. A total of 76 individuals recently mandated to outpatient commitment and 108 individuals (comparison group) recently discharged from psychiatric hospitals in the Bronx and Queens who were attending the same outpatient facilities as the group mandated to outpatient commitment were followed for one year and compared in regard to psychotic symptoms, suicide risk, serious violence perpetration, quality of life, illness-related social functioning, and perceived coercion and stigma. Propensity score matching and generalized estimating equations were used to achieve the strongest causal inference possible without an experimental design. Serious violence perpetration and suicide risk were lower and illness-related social functioning was higher (p<.05 for all) in the outpatient commitment group than in the comparison group. Psychotic symptoms and quality of life did not differ significantly between the two groups. Potential unintended consequences were not evident: the outpatient commitment group reported marginally less (p<.10) stigma and coercion than the comparison group. Outpatient commitment in New York State affects many lives; therefore, it is reassuring that negative consequences were not observed. Rather, people's lives seem modestly improved by outpatient commitment. However, because outpatient commitment included treatment and other enhancements, these findings should be interpreted in terms of the overall impact of outpatient commitment, not of legal coercion per se. As such, the results do not support the expansion of coercion in psychiatric treatment.
- Discussion
- 10.1016/j.amjmed.2016.06.050
- Oct 19, 2016
- The American Journal of Medicine
The Reply
- Research Article
10
- 10.1177/17407745211028588
- Jul 16, 2021
- Clinical Trials
Subgroup analyses are frequently conducted in randomized clinical trials to assess evidence of heterogeneous treatment effect across patient subpopulations. Although randomization balances covariates within subgroups in expectation, chance imbalance may be amplified in small subgroups and adversely impact the precision of subgroup analyses. Covariate adjustment in overall analysis of randomized clinical trial is often conducted, via either analysis of covariance or propensity score weighting, but covariate adjustment for subgroup analysis has been rarely discussed. In this article, we develop propensity score weighting methodology for covariate adjustment to improve the precision and power of subgroup analyses in randomized clinical trials. We extend the propensity score weighting methodology to subgroup analyses by fitting a logistic regression propensity model with pre-specified covariate-subgroup interactions. We show that, by construction, overlap weighting exactly balances the covariates with interaction terms in each subgroup. Extensive simulations were performed to compare the operating characteristics of unadjusted estimator, different propensity score weighting estimators and the analysis of covariance estimator. We apply these methods to the Heart Failure: A Controlled Trial Investigating Outcomes of Exercise Training trial to evaluate the effect of exercise training on 6-min walk test in several pre-specified subgroups. Standard errors of the adjusted estimators are smaller than those of the unadjusted estimator. The propensity score weighting estimator is as efficient as analysis of covariance, and is often more efficient when subgroup sample size is small (e.g. <125), and/or when outcome model is misspecified. The weighting estimators with full-interaction propensity model consistently outperform the standard main-effect propensity model. Propensity score weighting is a transparent and objective method to adjust chance imbalance of important covariates in subgroup analyses of randomized clinical trials. It is crucial to include the full covariate-subgroup interactions in the propensity score model.
- Abstract
1
- 10.1016/j.jaci.2010.12.327
- Feb 1, 2011
- Journal of Allergy and Clinical Immunology
Risk of Asthma in Former Late Preterm Infants: A Propensity Score Approach
- Abstract
1
- 10.1182/blood-2023-178154
- Nov 2, 2023
- Blood
First Line Therapy Evaluation Using Propensity Score Approach in Newly Diagnosed Advanced Classical Hodgkin Lymphoma Patients from Prospective Real-World Realysa Cohort and Phase 3 AHL2011 Trial
- Front Matter
19
- 10.1016/j.jclinepi.2013.05.012
- Jul 9, 2013
- Journal of Clinical Epidemiology
Methods for Comparative Effectiveness Research/Patient-Centered Outcomes Research: From Efficacy to Effectiveness
- Research Article
- 10.1007/s10654-025-01341-7
- Jan 12, 2026
- European journal of epidemiology
Machine learning (ML) algorithms are increasingly used to estimate propensity score with expectation of improving causal inference. However, the validity of data-driven ML-based approaches for confounder selection and adjustment remains unclear. In this study, we emulated the device-stratified secondary analysis of the PARADIGM-HF trial among U.S. veterans with heart failure and implanted cardiac devices from 2016 to 2020. We benchmarked observational estimates from three propensity score approaches against the trial results. (1) logistic regression with pre-specified confounders (2), generalized boosted models (GBM) using the same pre-specified confounders, and (3) GBM with expanded covariates and automated feature selection. Logistic regression-based propensity score approach yielded estimates closest to the trial (HR = 0.93, 95% CI 0.61-1.42; 23-month RR = 0.86, 95% CI 0.57-1.24 vs. trial HR = 0.81, 95% CI 0.61-1.06). Despite better predictive performance, GBM with pre-specified confounders showed no improvement over the logistic regression approach (HR = 0.97, 95% CI 0.68-1.37; RR = 0.96, 95% CI 0.89-1.98). Moreover, GBM with expanded covariates and data-driven automated feature selection substantially increased bias (HR = 0.61, 95% CI 0.30-1.23; RR = 0.69, 95% CI 0.36-1.04). Our findings suggest that ML-based propensity score methods do not inherently improve causal estimation possibly due to residual confounding from omitted or partially adjusted variables and may introduce overadjustment bias when combined with automated feature selection. These results underscore the importance of careful confounder specification and causal reasoning over algorithmic complexity in causal inference.
- Research Article
52
- 10.1002/bimj.201800132
- May 14, 2019
- Biometrical Journal
Propensity score matching (PSM) and propensity score weighting (PSW) are popular tools to estimate causal effects in observational studies. We address two open issues: how to estimate propensity scores and assess covariate balance. Using simulations, we compare the performance of PSM and PSW based on logistic regression and machine learning algorithms (CART; Bagging; Boosting; Random Forest; Neural Networks; naive Bayes). Additionally, we consider several measures of covariate balance (Absolute Standardized Average Mean (ASAM) with and without interactions; measures based on the quantile-quantile plots; ratio between variances of propensity scores; area under the curve (AUC)) and assess their ability in predicting the bias of PSM and PSW estimators. We also investigate the importance of tuning of machine learning parameters in the context of propensity score methods. Two simulation designs are employed. In the first, the generating processes are inspired to birth register data used to assess the effect of labor induction on the occurrence of caesarean section. The second exploits more general generating mechanisms. Overall, among the different techniques, random forests performed the best, especially in PSW. Logistic regression and neural networks also showed an excellent performance similar to that of random forests. As for covariate balance, the simplest and commonly used metric, the ASAM, showed a strong correlation with the bias of causal effects estimators. Our findings suggest that researchers should aim at obtaining an ASAM lower than 10% for as many variables as possible. In the empirical study we found that labor induction had a small and not statistically significant impact on caesarean section.
- Research Article
3
- 10.1186/s12874-025-02508-2
- Mar 7, 2025
- BMC Medical Research Methodology
PurposeThis paper extends current propensity score weighting methods for causal inference to better understand disparities in healthcare access across multiple racial groups. By treating each racial group as a distinct entity (or “treatment”) in the causal inference framework, we can assess and evaluate heterogeneity in healthcare outcomes across various racial or ethnic categories. Furthermore, we leverage modern propensity score weighting techniques to address the challenges inherent to multiple group evaluations, such as violations of the positivity assumption, and compare the performance of different propensity score weights.MethodsWe use generalized propensity score methods to assess racial disparities across 4 specific racial or ethnic groups: Whites, Hispanics, Asians, and Blacks. We first calculate weights that standardize the participants’ characteristics and then compare their weighted outcomes. We consider four distinct measures (i.e., causal estimands) and estimation methods: the conventional average treatment effect on the treated (ATT), the ATT trimming, the ATT truncation, and the overlap weighted ATT (OWATT). These estimands are applied under a multi-valued “treatment” framework, where the said “treatment” is defined by non-manipulable racial or ethnic group memberships. Using data from the Medical Expenditure Panel Survey (MEPS), we assess disparities in healthcare expenditures across the 4 racial and ethnic groups.ResultsWe found significant disparities in healthcare expenditure between White participants and all the other racial or ethnic groups when using OWATT and ATT truncation. Conventional ATT and ATT trimming could indicate non-significant difference due to larger variance estimates. Moreover, the conventional ATT was found to be the least efficient estimation method, even when its variance was estimated via non-parametric bootstrapping. Overall, the OWATT emerges as a promising estimation method; it retains the available information from all samples, avoids subjectivity (inherent to choosing thresholds by its competitors) and mitigates judiciously pernicious inferential effects (such as the inflated variance estimates) by extreme propensity score weights.ConclusionWe found that generalized propensity score weighting (GPSW) methods are valuable quantitative tools to standardize and compare characteristics as well as outcomes of non-manipulable groups. This helps assess disparities across multiple racial and ethnic groups, as demonstrated in this study. These methods offer flexible and semi-parametric analysis on the primary causal parameters of interest (such as the racial disparities), with straightforward and intuitive interpretations. In addition, when there is violation of the positivity assumption, OWATT serves as an excellent alternative due to its greater efficiency, evidenced by relatively smaller variance. More importantly, the OWATT uses the entire dataset by assigning weights to all participants, regardless of their propensity score values. This feature of OWATT circumvents the need to specify user-defined thresholds, as required in ATT trimming or truncation, and retains as much data information as possible, leading to more reliable estimation results.
- Research Article
33
- 10.1016/j.leaqua.2023.101678
- Feb 27, 2023
- The Leadership Quarterly
When treatment cannot be manipulated, propensity score analysis provides a useful way to making causal claims under the assumption of no unobserved confounders. However, it is still rarely utilised in leadership and applied psychology research. The purpose of this paper is threefold. First, it explains and discusses the application and key assumptions of the method with a particular focus on propensity score weighting. This approach is readily implementable since a weighted regression is available in most statistical software. Moreover, the approach can offer a “double robust” protection against misspecification of either the propensity score or the outcome model by including confounding variables in both models. A second aim is to discuss how propensity score analysis (and propensity score weighting, specifically) has been conducted in recent management studies and examine future challenges. Finally, we present an advanced application of the approach to illustrate how it can be employed to estimate the causal impact of leadership succession on performance using data from Italian football. The case also exemplifies how to extend the standard single treatment analysis to estimate the separate impact of different managerial characteristic changes between the old and the new manager.
- Research Article
106
- 10.1016/j.juro.2006.10.040
- Feb 10, 2007
- Journal of Urology
Long-Term Survival in Men With High Grade Prostate Cancer: A Comparison Between Conservative Treatment, Radiation Therapy and Radical Prostatectomy—A Propensity Scoring Approach
- Research Article
- 10.1101/2025.06.16.25329708
- Sep 26, 2025
- medRxiv
Machine learning (ML) algorithms are increasingly used to estimate propensity score with expectation of improving causal inference. However, the validity of ML-based approaches for confounder selection and adjustment remains unclear. In this study, we emulated the device-stratified secondary analysis of the PARADIGM-HF trial among U.S. veterans with heart failure and implanted cardiac devices from 2016 to 2020. We benchmarked observational estimates from three propensity score approaches against the trial results: (1) logistic regression with pre-specified confounders, (2) generalized boosted models (GBM) using the same pre-specified confounders, and (3) GBM with expanded covariates and automated feature selection. Logistic regression-based propensity score approach yielded estimates closest to the trial (HR = 0.93, 95% CI 0.61 – 1.42; 23-month RR = 0.86, 95% CI 0.57 – 1.24 vs. trial HR = 0.81, 95% CI 0.61–1.06). Despite better predictive performance, GBM with pre-specified confounders showed no improvement over the logistic regression approach (HR = 0.97, 95% CI 0.68 – 1.37; RR = 0.96, 95% CI 0.89 – 1.98). Notably, GBM with expanded covariates and data-driven automated feature selection substantially increased bias (HR = 0.61, 95% CI 0.30 – 1.23; RR = 0.69, 95% CI 0.36 – 1.04). Our findings suggest that ML-based propensity score methods do not inherently improve causal estimation—possibly due to residual confounding from omitted or partially adjusted variables—and may introduce overadjustment bias when combined with automated feature selection, underscoring the importance of careful confounder specification and causal reasoning over algorithmic complexity in causal inference.
- Research Article
24
- 10.1007/s13142-015-0361-9
- Nov 20, 2015
- Translational Behavioral Medicine
There is a demand for providing evidence on the effectiveness of research investments on the promotion of novice researchers' scientific productivity and production of research with new initiatives and innovations. We used a mixed method approach to evaluate the funding effect of the New Investigator Fund (NIF) by comparing scientific productivity between award recipients and non-recipients. We reviewed NIF grant applications submitted from 2004 to 2013. Scientific productivity was assessed by confirming the publication of the NIF-submitted application. Online databases were searched, independently and in duplicate, to locate the publications. Applicants' perceptions and experiences were collected through a short survey and categorized into specified themes. Multivariable logistic regression was performed. Odds ratios (OR) with 95% confidence intervals (CI) are reported. Of 296 applicants, 163 (55%) were awarded. Gender, affiliation, and field of expertise did not affect funding decisions. More physicians with graduate education (32.0%) and applicants with a doctorate degree (21.5%) were awarded than applicants without postgraduate education (9.8%). Basic science research (28.8%), randomized controlled trials (24.5%), and feasibility/pilot trials (13.3%) were awarded more than observational designs (p < 0.001). Adjusting for applicants and application factors, awardees published the NIF application threefold more than non-awardees (OR = 3.4, 95%, CI = 1.9, 5.9). The survey response rate was 90.5%, and only 58% commented on their perceptions, successes, and challenges of the submission process. These findings suggest that research investments as small as seed funding are effective for scientific productivity and professional growth of novice investigators and production of research with new initiatives and innovations. Further efforts are recommended to enhance the support of small grant funding programs.
- Research Article
33
- 10.1016/j.jpedsurg.2018.06.003
- Jun 7, 2018
- Journal of Pediatric Surgery
Outcomes of infants with congenital diaphragmatic hernia treated with venovenous versus venoarterial extracorporeal membrane oxygenation: A propensity score approach
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.