Abstract

•A propensity analysis is a statistical approach that attempts to reduce selection bias and known confounding in an observational study.•Integration of propensity scores into the design and analysis of an observational study helps to mitigate confounding by indication and improve internal validity.•Propensity scores estimate the probability that an individual would have received a particular treatment based on observed baseline characteristics.•The quality of the resulting data is dependent on the adequacy of the propensity score model and the analysis method.•The propensity score can be used in multiple ways, including matching, stratification, inverse probability of treatment weighting, or covariate adjustment in regression.•Propensity scores are not a substitute for randomization. Randomization is the only approach that guarantees balanced distributions of known and unknown confounders between treatment groups, allowing for causal statements regarding the treatment effect. You are interested in conducting a study to assess the association between radiation therapy technique (whole breast irradiation [WBI] and partial breast irradiation [PBI]) and the risk of local recurrence for patients with ductal carcinoma in situ (DCIS) of the breast. Because of the excellent prognosis of this subgroup of patients and the subsequent rarity of recurrences, conducting a well-powered randomized controlled trial would entail resources beyond your reach. Instead, you decide to conduct an observational study using a large retrospective database from your institution. Your primary concern is that owing to the lack of randomization, patients chosen for each treatment approach may be inherently different from each other. What statistical approaches are available to help mitigate the impact of this limitation? Randomized controlled trials remain the gold standard for estimating the causal effect of a treatment on patient outcomes. Random assignment to treatment arms maximizes the likelihood that the treatment groups are composed of patients with similar baseline characteristics—both known and unknown—enabling the assumption that the difference in outcomes is due to the study intervention as opposed to another factor. If patients in the 2 treatment arms had different baseline characteristics related to prognosis, this discrepancy might inadvertently influence the conclusion of the trial. This phenomenon, called “confounding,” is the presence of a known or unknown variable that is associated with both the intervention and the outcome, such that a perceived but spurious relationship between the 2 is actually due to the confounder. Although randomization is the ideal approach to reduce the influence of confounding variables, it is not always feasible, so various statistical approaches have been developed to minimize the bias present in nonrandomized studies. The common approach is to use a multivariable regression model to adjust for the effect of known confounders. However, multivariable models can be limited in terms of statistical power, especially when there is a high ratio of predictors to events; in this situation, termed “overfitting,” overloading the model with too many variables can lead to inaccuracy and instability in the effect estimates. This issue can be particularly problematic in cases such as our vignette, where we have many known confounders but are expecting a small number of breast recurrences, a relatively frequent phenomenon in retrospective radiation therapy studies. Therefore, statisticians have recently favored an alternate approach to account for systematic differences in baseline characteristics between treatment groups in observational studies. This approach entails use of a “propensity score” and was developed by 2 statisticians, Rosenbaum and Rubin, in the 1980s.1Rosenbaum P.R. Rubin D.B. The central role of the propensity score in observational studies for causal effects.Biometrika. 1983; 70: 41-55Crossref Scopus (16273) Google Scholar The propensity score is defined as the probability that an individual would have been allocated to a particular treatment group as a function of observed baseline characteristics.1Rosenbaum P.R. Rubin D.B. The central role of the propensity score in observational studies for causal effects.Biometrika. 1983; 70: 41-55Crossref Scopus (16273) Google Scholar These conditional probabilities are most commonly estimated using multivariable logistic regression in which the treatment group is the dependent variable and the baseline characteristics are the independent variables. In other words, instead of directly modeling the outcome of interest (eg, breast local recurrence), first one models the likelihood of receiving a given treatment (eg, WBI) using the baseline characteristics that may influence treatment choice and/or outcome as predictors. The choice of variables used is paramount because the validity of a propensity score analysis is associated with adequate specification of the propensity score model. As the old saying goes, “garbage in, garbage out.” The regression model then produces the predicted probabilities of a given treatment, which are termed the propensity scores, ranging from 0 to 1 for each patient. In the DCIS case vignette, the propensity score would be obtained using logistic regression to model the probability of receiving WBI as a function of margin status, grade, age, and other potential confounders known to affect the risk of local recurrence. It is important to note that baseline characteristics used in the logistic model may not be statistically significant and may include correlated predictors. Once the propensity score has been obtained for each individual in the cohort, 4 methods are commonly used to incorporate the scores into study design and data analysis. These include matching, stratification, inverse probability of treatment weighting, and covariate adjustment.2Haukoos J.S. Lewis R.J. The propensity score.JAMA. 2015; 314: 1637-1638Crossref PubMed Scopus (340) Google Scholar,3Thavaneswaran A. Lix L. Propensity score matching in observational studies. Manitoba Centre for Health Policy, 2008Google Scholar Propensity score matching is the most straightforward, rigorous approach to reducing bias using propensity scores. This method entails pairing patients in one treatment group with patients in the other treatment group who have similar propensity scores. Although 1:1 matching is the most common implementation, more than 1 control patient may be used in many-to-one (M:1) matching to obtain a matched set. The primary method for selecting the matches is called “nearest neighbor matching,” where the difference in propensity scores is minimized within each matched pair based on a prespecified threshold. Matching may be done without replacement or with replacement, wherein a control patient may be included in more than 1 matched pair or set. Matching has the added benefit of producing a table of observed baseline characteristics by treatment group (similar to a “Table 1” in a randomized study). The reader can look at the standardized difference of means or proportions to assess whether the distribution of baseline characteristics is similar between treatment groups in a matched sample. Although no conventional criterion of imbalance has been established, statisticians have proposed that a standardized difference between treatment groups of <0.1 of the pooled standard deviation suggests that the mean or proportion of a baseline characteristic is similar.4Normand S.T. Landrum M.B. Guadagnoli E. et al.Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: A matched analysis using propensity scores.J Clin Epidemiol. 2001; 54: 387-398Abstract Full Text Full Text PDF PubMed Scopus (805) Google Scholar In addition, the entire distribution of baseline characteristics between treatment groups should be compared using graphical methods such as boxplots and quantile-quantile plots.5Austin P.C. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples.Stat Med. 2009; 28: 3083-3107Crossref PubMed Scopus (2737) Google Scholar The critical caveat of using propensity score matching is that unmatched patients are not analyzed, which may result in decreased statistical power or even introduce bias. To use propensity score matching to compare local recurrence risk after WBI versus PBI for DCIS, we would first develop a logistic regression model to create the propensity score, assigning every patient in the cohort a probability from 0 to 1. We then match every WBI patient with 1 unique PBI patient with a similar propensity score (eg, patient X treated with WBI had a score of 0.830 and patient Y treated with PBI had a score of 0.835). Unmatched PBI and WBI patients are excluded from the analysis, and one then compares the recurrence outcomes between the 2 groups using standard methods for survival analysis, such as a cumulative incidence analysis incorporating competing risks. This method uses an approach similar to matching, but the study population is divided into mutually exclusive subsets based on the propensity score. A common approach is to divide patients into 5 groups of equal sizes at the quintiles of the propensity score distribution. Within each stratum, treatment groups have propensity scores in the same range and thus, in principle, a similar distribution of observed baseline characteristics. By comparing outcomes directly between treatment groups within each stratum, the treatment difference is first estimated within each stratified group. The stratum-specific effects are then pooled across all strata as a weighted average to estimate an overall treatment difference. The stratum-specific weights represent the proportion of all patients in the corresponding strata, and thus the overall treatment difference is a simple average of stratum-specific estimates if the propensity score stratification is based on patient groups of equal sizes. A key benefit of stratification is that it allows for the inclusion of all cases, but because the grouping by propensity score is not as granular, it may not account for strong confounding to the same extent as the matched approach. However, the performance tends to be poor when the outcome events are few, particularly relative to a large number of strata. Returning to the DCIS vignette, a stratification approach would first be a direct comparison of treatment groups within each stratum using competing risks regression. To estimate the overall difference in local recurrence risk after WBI versus PBI, the stratum-specific hazard ratios are then averaged across all strata. If the strata are of unequal sizes, the average treatment effect is a weighted estimate using weights corresponding to the patient numbers in each stratum. This method entails allocating a weight to each patient in the cohort. This weight is calculated from the reciprocal of the probability of receiving the treatment that the patient actually received. The probability is the propensity score for patients who received the experimental treatment, whereas it is (1 – propensity score) for controls. Essentially, the weighting increases the data contribution of patients who actually received the treatment they were unlikely to receive based on their propensity score. The goal of weighting is to generate a synthetic sample with virtually balanced covariates so that the treatment is independent of confounding. Patients are then analyzed using this weight to obtain a weighted estimate of the average treatment effect. This approach allows for the inclusion of all cases with the notable limitation that the weighting model may be unstable or inaccurate at extreme weights, when patients have a very high or low probability of receiving the treatment actually received. In our case vignette, the weights are the reciprocal of the propensity score for WBI patients and the reciprocal of (1 – propensity score) for PBI patients. Using these weights, we would fit a competing risks regression model relating local recurrence to a treatment indicator (1 = WBI, 0 = PBI) to estimate the hazard ratio as a measure of the treatment difference between WBI and PBI. This approach incorporates the propensity score as a predictor in a conventional regression model with treatment group as an indicator variable. This method is the most commonly used3Thavaneswaran A. Lix L. Propensity score matching in observational studies. Manitoba Centre for Health Policy, 2008Google Scholar implementation of the propensity score because it allows for straightforward inclusion of all patients in the analysis. However, the pros and cons of covariate adjustment are debated.6Austin P.C. The relative ability of different propensity score methods to balance measured covariates between treated and untreated subjects in observational studies.Med Decis Making. 2009; 29: 661-677Crossref PubMed Scopus (301) Google Scholar, 7Garrido M.M. Kelley A.S. Paris J. et al.Methods for constructing and assessing propensity scores.Health Serv Res. 2014; 49: 1701-1720Crossref PubMed Scopus (458) Google Scholar, 8Elze M.C. Gregson J. Baber U. et al.Comparison of propensity score methods and covariate adjustment: Evaluation in 4 cardiovascular studies.J Am Coll Cardiol. 2017; 69: 345-357Crossref PubMed Scopus (291) Google Scholar A multivariable model may be used to adjust for additional variables expected to be potential confounders of the study outcome. Thus, a particular variable may be included in both the original propensity model and the multivariable regression model to adjust for residual confounding. To overcome insufficient covariate balance, baseline characteristics in the propensity score model may be incorporated as predictors in the multivariable regression model along with the propensity score and treatment indicator.1Rosenbaum P.R. Rubin D.B. The central role of the propensity score in observational studies for causal effects.Biometrika. 1983; 70: 41-55Crossref Scopus (16273) Google Scholar,9D’Agostino Jr., R.B. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group.Stat Med. 1998; 17: 2265-2281Crossref PubMed Scopus (4287) Google Scholar,10Kang J.D.Y. Schafer J.L. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data.Stat Sci. 2007; 22: 523-539Crossref Scopus (663) Google Scholar The approach is doubly robust by protecting against misspecifications of the propensity score model and the multivariable regression model, but it reintroduces the challenge of multivariable modeling including overfitting. If we were to apply covariate adjustment to the DCIS vignette, we would fit a competing risk regression model relating local recurrence to a treatment indicator (1 = WBI, 0 = PBI) and the propensity score. Inclusion of baseline characteristics in the regression model is unlikely to be feasible for a doubly robust approach because the total number of local recurrences is small. Propensity scores provide a way to design and analyze an observational study that estimates the independent association between a treatment group and an outcome variable in settings where randomization is not feasible.3Thavaneswaran A. Lix L. Propensity score matching in observational studies. Manitoba Centre for Health Policy, 2008Google Scholar Such an approach has the statistical properties to partially address the impact of confounding by indication and thus improve the internal validity of the study. Referring to the case vignette, suppose that WBI was truly superior to PBI in efficacy but that patients with positive DCIS margins were more likely to be offered WBI. The rate of recurrence in the group with positive margins may be higher and erroneously attributed to the inferiority of WBI. There would therefore be an underestimation of the true benefit of WBI (biasing the treatment effect toward the null hypothesis of no difference). This confounding effect could be partially addressed by propensity score matching to take into account the impact of margin status so that at baseline, covariate balance is achieved between treatment groups.2Haukoos J.S. Lewis R.J. The propensity score.JAMA. 2015; 314: 1637-1638Crossref PubMed Scopus (340) Google Scholar,3Thavaneswaran A. Lix L. Propensity score matching in observational studies. Manitoba Centre for Health Policy, 2008Google Scholar Thus, propensity score matching would improve estimation of the causal treatment effect in an observational study by mimicking some of the statistical properties of a randomized controlled trial. In particular, in observational studies with few events, propensity score approaches provide greater statistical power than traditional multivariable regression, allowing for more effective control of confounding variables without the concerns about overfitting or model convergence in conventional multivariable analysis. Although there are distinct advantages to using propensity scores, there are also limitations. It is important to note that propensity score methods work best in large data sets in which one can obtain a reasonable spread of baseline characteristics between the treatment groups. In addition, the range of propensity scores needs to overlap substantially between treatment groups to enable matching or regression modeling. In the example of our case vignette, if all the patients receiving PBI were >70 years old, but all the patients receiving WBI were <70 years old, balancing age between the 2 groups would be extremely challenging. Furthermore, the quality of the propensity score analysis is dependent on adequate specification of the propensity score model. The propensity score will only adjust for the impact of observed confounders that are included as predictors in the logistic regression used to generate it. There will be no adjustment for the impact of baseline characteristics that are not included in the propensity score model, including unknown or unmeasured covariates.3Thavaneswaran A. Lix L. Propensity score matching in observational studies. Manitoba Centre for Health Policy, 2008Google Scholar In our case vignette, suppose that the grade of the DCIS lesion was documented in only 50% of cases. This lack of data would compromise the adequacy of the propensity score model. Alternately, suppose that a yet-to-be identified molecular prognostic marker existed in a higher proportion in patients receiving partial breast irradiation. This marker would not be accounted for in any way in the propensity score model. These examples illustrate that although the use of propensity scores can help to balance known, observed confounders, randomization is the only approach that guarantees balanced distributions of known and unknown confounders between treatment groups. In the case vignette, the use of a propensity score would allow the design and analysis of an observational study that reduces the effects of confounding in the estimation of the effect of whole breast irradiation versus partial breast irradiation on local control for patients with DCIS of the breast. This example demonstrates how a propensity score is a useful instrument to reduce the effects of confounding between treatment groups in observational studies. A thorough understanding of the propensity score model and potential shortcomings of each approach is required to accurately interpret results. Propensity score methods should not be misinterpreted as a statistical panacea for addressing confounding; however, used with care, they may provide investigators with a valuable tool to optimize the design and analysis of an observational study.Tabled 1Dos and Don’ts•Do consider the use of propensity scores when trying to achieve “balance” of baseline characteristics between treatment groups to reduce confounding effects in an observational study with a large data set when there are few outcome events of interest.•Don’t assume that the propensity scores will completely remove the effect of bias in estimating treatment effects: it does not! It can only account for known variables.•Do assess the propensity score method chosen with an understanding of the caveats of each approach.•Don’t believe that a propensity score analysis can always be performed; there must be adequate numbers of patients with an overlapping distribution of known baseline characteristics for a propensity score analysis. Open table in a new tab Tabled 1Glossary of Terms•Bias: A measured result that differs from the true association between an independent variable (intervention) and a dependent variable (outcome).•Confounding: The presence of a known or unknown factor that is associated with both the independent variable and the dependent variable, leading to a bias in the measured results.•Overfitting: The inclusion of a high number of predictive variables (in comparison to the number of events) in a multivariable regression model, leading to inaccuracy and instability in the effect estimates.•Nearest Neighbor Matching: A method used for propensity score matching that entails minimizing the absolute difference between the estimated propensity scores of each matched pair. A randomly selected individual in treatment group A is matched with an individual in treatment group B who has the propensity score closest in value. This process is repeated until all participants who can be matched within a prespecified range are accounted for.•Confounding by Indication: A biased association between an independent variable and a dependent variable in observational studies in which patients are allocated to a particular treatment group based on their presentation of baseline characteristics. Open table in a new tab The authors would like to thank Dr David Sher for his guidance and support.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call