Abstract

HomeCirculationVol. 118, No. 8Some Old and Some New Statistical Tools for Outcomes Research Free AccessResearch ArticlePDF/EPUBAboutView PDFView EPUBSections ToolsAdd to favoritesDownload citationsTrack citationsPermissions ShareShare onFacebookTwitterLinked InMendeleyReddit Jump toFree AccessResearch ArticlePDF/EPUBSome Old and Some New Statistical Tools for Outcomes Research Sharon-Lise T. Normand, PhD Sharon-Lise T. NormandSharon-Lise T. Normand From the Department of Health Care Policy, Harvard Medical School, and Department of Biostatistics, Harvard School of Public Health, Boston, Mass. Search for more papers by this author Originally published19 Aug 2008https://doi.org/10.1161/CIRCULATIONAHA.108.766907Circulation. 2008;118:872–884Outcomes research “seeks to understand the end results of particular health care practices and interventions”1 to inform the development of clinical practice guidelines, evaluate the quality of medical care, and foster effective interventions to improve the quality of care.2 Although randomized trial designs have been used to assess quality of care and to identify effective interventions in the real world,3,4 the empirical basis of outcomes research largely rests on data collected in the observational setting (eg, in the routine setting of everyday practice).With more emphasis placed on increasing the value of health care in terms of lives saved and morbidity avoided, outcomes researchers are making unprecedented demands of observational databases. This is evidenced by the increasing number of and participation in national registries. These include the National Cardiac Data Registry of the American College of Cardiology; the Implantable Cardioverter Defibrillator Registry launched jointly by the American College of Cardiology and the Heart Rhythm Society; the National Cardiac Surgery Database of the Society of Thoracic Surgeons; and the Interagency Registry of Mechanically Assisted Circulatory Support Devices funded by the National Heart, Lung, and Blood Institute, the Centers for Medicare and Medicaid Services, and others. Empirical analyses of these databases require statistical tools that can handle the complexity of the data: observational, sometimes hierarchical, often with multiple outcomes, and always with some missing data.The purpose of the present article is to review key statistical methods important to outcomes research and to introduce newer methodology. The article describes 4 methodological issues commonly present when observational data are analyzed; summarizes the primary assumptions associated with strategies to handle these common problems; demonstrates methods to assess the plausibility of the assumptions associated with each strategy; and illustrates these concepts using examples from cardiovascular outcomes research. Although the intent of the present report is not to provide a comprehensive summary of statistical approaches to data analysis, it is intended to provide a clear understanding of the assumptions associated with some common methodological tools and of the strategies used to assess their plausibility. If these 2 goals are achieved, then both the rigor of and the scientific findings from outcomes research will be strengthened substantially.Four Common Statistical ProblemsAn observational study is an empirical investigation in which the objective is to understand causal effects. Specifically, an observational investigation “concerns treatments, interventions, or policies, and the effects they cause.”5 Because subjects are not randomized to treatments, several potential sources of bias exist that threaten the validity of findings. The problems induced by lack of randomization are not new, nor are the analytical strategies used to strengthen conclusions from observational studies. Nonetheless, more appropriate use and consistent reporting of these analytical strategies should be adopted in cardiology outcomes research. A series of 3 papers in the “Education and Debate” section of the British Medical Journal describe several practical questions to be asked by researchers when reading results from observational studies.6–8Another often-ignored problem involves the structure of the data. Data are frequently clustered or “hierarchical” in nature, and this structure induces specific relationships among the data. Virtually all data have some hierarchical structure (eg, patients are clustered in hospitals), and ignoring the data structure will often lead to erroneous conclusions. Missing data are yet another challenge to researchers, occurring in both randomized and observational studies. Despite availability of statistical tools to handle missing data, unsupported methods continued to be used. Finally, an emerging issue relates to the increasing use of multiple outcomes and multiple informants in outcomes research.9 Investigators collect and assess multiple outcomes to comprehensively assess a treatment or policy effect but continue to use ad hoc pooling strategies to reach their conclusions.Absence of RandomizationAlthough sometimes not stated explicitly, the most common goal in outcomes research involves the establishment of causation. For example, do drug-eluting stents (DES) cause excess mortality compared with bare-metal stents (BMS)? Does early catheterization in unstable angina patients lead to better in-hospital outcomes? Does invasive cardiac management increase survival after acute myocardial infarction (AMI)? Causal inference focuses on what would happen to a specific individual under different treatment options. In contrast, predictive inference focuses on the comparison of outcomes between groups of individuals who have received different treatments. Causal inference can be thought of as a special case of predictive inference in which subjects who could have received either treatment are identified and used to infer treatment effects (eg, what would have happened to a patient’s survival had the patient received a different treatment than the one observed?).Specific features of a randomized clinical trial10 permit researchers to conclude whether a treatment or intervention is efficacious. First, the experimenter determines the assignment of treatments to patients using a known mechanism. This mechanism is the randomization allocation probability determined by the experimenter and implemented with standard software. Treatment allocation may correspond to equal allocation between treatment arms for all trial patients or equal allocation between treatment arms within important patient groups, such as diabetic and nondiabetic patients. Allocation probabilities can be fixed, or they can be adaptive procedures11 that change as the study progresses. The outcome of the random assignment (eg, subject is assigned to treatment A) has several key properties, 2 of which include (1) that it is predictive of treatment taken, so that if we were to model the probability of treatment received as a function of treatment assignment, the odds ratio of the treatment assignment variable would be large, and (2) that treatment assignment is not related to outcome if we take into account the treatment received. This implies that it is the treatment received that causes a change in patient outcome and not the treatment assigned. A variable with these properties is also referred to as an “instrumental variable.”12,13A second characteristic of a randomized trial is a surprisingly simple fact: Every subject who meets the study inclusion criteria has a chance of receiving the treatment. This implies that the probability that a trial participant receives the study treatment is always greater than zero. This is due both to study inclusion/exclusion criteria that are developed to define the target population and to experimental control over who receives the study treatment. This seemingly trivial point is often ignored in observational studies.The third feature is that, in theory, no unmeasured or measured variables (denoted confounders) are present that relate to both treatment assignment and outcome. Statistically, this implies that the “potential” outcome and treatment assignment are independent given the patient covariates. This means that because participants have been allocated randomly to treatment groups, and each participant had a chance of receiving treatment, the only difference between the treatment groups is treatment assignment. The standard estimate of the treatment effect is the intention-to-treat estimate, in which the average outcome of those assigned to treatment is subtracted from the average outcome of those assigned to the comparison treatment. The intention-to-treat estimate is only valid under the assumption of full treatment compliance and no missing data,14 and very few studies meet these criteria. Randomized studies have additional important features, such as blinding, which will not be discussed here.No Unmeasured ConfoundersObservational studies generally fail to meet most of the assumptions required to support a causal conclusion (Table 1). If no unmeasured variables exist that confound the relationship between treatment assignment and the outcome, the assignment mechanism is said to be “ignorable.” The basis for this term is that the investigator can “ignore” the treatment assignment as long as the observed confounders are used to adjust outcome comparisons. For causal inferences in the ignorable setting, approaches to data analysis fall into 3 general categories. The most common is a regression model in which the confounders and treatment received are regressed on the outcome, and adjusted outcomes are estimated. Numerous statistical packages are available that researchers can use. A second approach is through matching or stratifying of subjects, by which categories formed by the set of confounders are created that contain both treated and comparison subjects. Outcome differences within each category are computed and then combined to form an overall estimated effect. The third approach is a combination of the first 2 in which the set of confounders is reduced to a single balancing score, often into a propensity score,15 and outcomes are examined within groups defined by the score. Table 1. Strengthening Conclusions to Support Estimates of Treatment or Intervention Effects When Assuming All “Important” Confounders Are MeasuredFeatureClinical TrialObservational StudyAssessment in Observational SettingConfounders are variables related to the probability of treatment assignment and to the outcomes of interest.*Assuming full compliance and no missing data.How were treatments assigned?Experimenter determines allocation using a known mechanismResearcher has no control over treatment assignment, and the assignment mechanism is unknown(1) Estimate the treatment-assignment mechanism via a propensity score model: P(treatment=k vs not k given confounders)(2) Estimate the treatment-assignment mechanism using the variation predicted by an instrumental variableDoes everyone have a chance of receiving the intervention or treatment?Yes (by design)NoDetermine whether there are subjects with estimated propensity scores very close to 0 or to 1; eliminate these subjects from further analysesAre all patients comparable between treatment or intervention arms?Almost alwaysSometimes(1) Examine standardized differences of covariates between arms (Figure 1)(2) Examine the overlap in the distribution of estimated propensity scores between treatment arms (Figures 2 and 3)What is the effect of the treatment or intervention on the outcome?Intention-to-treat: difference between the mean outcome among those assigned to treatment and those assigned to the comparison group*Difference between the mean outcome among those who received the treatment and those in the comparison group in the subset of patients who are similar on the basis of the measured confoundersConduct a sensitivity analysis to determine how sensitive conclusions are to unmeasured confoundersEach approach is associated with specific assumptions. In addition to the usual distributional and independence assumptions associated with regression modeling, 2 additional assumptions need assessment: (1) determination of sufficient overlap of the measured confounders to permit sensible estimation of treatment effects and (2) determination of similar “distributions” of confounders so that conclusions do not depend on the distributional assumptions made by the regression model. In fact, regression modeling can perform poorly when the variances of the confounders between treatment groups are unequal, which is very often the case in observational studies. Matching or stratifying on the confounders is sensible but becomes difficult when the number of confounders is large. With 20 confounders, each assuming 2 categories, there will be approximately 1 million matching categories. The combination of approaches offers a practical alternative and facilitates assessment of the plausibility of statistical assumptions.Illustrative Example: Long-Term Clinical Outcomes After DES and BMS in MassachusettsMauri and colleagues16 compared all-cause mortality, revascularization, and myocardial infarction rates between 11 516 patients undergoing DES implantation and 6210 patients undergoing BMS implantation between April 1, 2003, and September 30, 2004. Patients were not randomized to stent type and differed on several observed confounders. To compare average differences between the groups in each confounder on a common scale, percent standardized differences in mean values (difference in mean values divided by the pooled SDs for the DES and BMS groups) for a number of important confounders were computed (Figure 1). Large differences were observed for commercial health insurance, acute coronary syndrome (ACS), status of the procedure, and ejection fraction. Download figureDownload PowerPointFigure 1. Percent standardized differences between DES and BMS patients for selected prestent characteristics stratified by type of characteristic. Open circles (○) denote mean standardized difference for 17 726 patients; filled squares (▪) denote mean standardized differences for 3752 matched pairs. HMO indicates health maintenance organization; NYHA, New York Heart Association; CCS, Canadian Classification System; EF, ejection fraction; LCX, left circumflex; LM, left main; and RCA, right coronary artery.To assess comparability of the distributions of the observed confounders, plots of the relative frequency of values for each confounder (a density estimate) for the BMS and DES groups were constructed. Figure 2 presents density estimates for 3 (of a total of 65) confounders and for the log-odds of the estimated propensity scores. ACS for more than 1 day (yes versus no) was transformed by subtracting the mean value of the entire cohort and dividing by its SD so that positive values indicated a higher than average chance of having ACS for more than 1 day and negative values indicated the opposite. The ACS distributions had different means but the same shapes (Figure 2A); the days on market distributions had different means and different skews for the 2 groups (Figure 2B), but the age distributions appear comparable (Figure 2C). When all confounders were considered simultaneously, the propensity score distributions (Figure 2D) had different means and different skews for the DES and BMS groups. Figure 2D also demonstrates a lack of overlap in the left tail between the 2 groups, which suggests that there may have been no DES patients “comparable” to BMS patients for these propensity scores. Figure 3 suggests greater comparability of the distributions when a matched sample of 3752 BMS and DES pairs created from the estimated propensity scores was used. For this subset of patients, the assumptions for comparability on measured confounders were met. Download figureDownload PowerPointFigure 2. Distribution of selected covariates for DES and BMS patients (17 726 patients). X axis is the value of the covariate, and y axis is a density estimate that reflects the proportion of the DES (dashed lines) and BMS (solid lines) groups having the particular values of the covariate.Download figureDownload PowerPointFigure 3. Distribution of selected covariates for DES and BMS patients (3752 matched pairs). X axis is the value of the covariate, and y axis is a density estimate that reflects the proportion of the DES (dashed lines) and BMS (solid lines) groups having the particular values of the covariate.Unmeasured ConfoundersIf insufficient measured confounders are present to adequately capture the selection of treatments and outcomes, then the investigator cannot ignore the treatment-assignment mechanism. In this case, the treatment assignment is said to be “nonignorable.” Determination of whether the treatment assignment is ignorable is based on the clinical problem and the richness of the measured variables. One method available to researchers when treatment assignment is nonignorable is through the use of instrumental variables. The estimated treatment effect is loosely calculated as the difference in mean outcomes between treatment groups divided by the difference in treatment assignment predicted by the instrument between the 2 groups. An estimate that uses an instrument that is weakly associated with treatment, eg, one that does not predict treatment assignment well, may give misleading results. Several assumptions exist that need to be satisfied when an instrumental variables analysis is used (Table 2), as well as different approaches to estimating the treatment effect.17 Some popular software packages include instrumental variables analysis such as Stata ivreg and ivprobit (Stata Corp, College Station, Tex) or SAS Proc SYSLIN (SAS Institute Inc, Cary, NC). Table 2. Assumptions Needed When Conducting an Instrumental Variables AnalysisAssumptionExample: Instrument=Weekend vs Weekday Hospital Presentation; Treatment=Early Catheterization; Outcome=Death, Myocardial Infarction, Stroke, Shock, or Congestive Heart FailureApplied to study by Ryan et al.14 Confounders are variables related to the probability of treatment assignment and to the outcomes of interest.The instrument is not related to unmeasured confounders; cannot be tested empiricallyInvestigators used detailed data collected as part of a registry that used common definitions. Means of observed covariates and use of in-hospital medications (β-blockers, etc) appeared similar between the weekend-presenting patients and weekday-presenting patients.(1) The relationship between the instrument and treatment received is nonzero (a powerful predictor of treatment received) and (2) the instrument is related to outcome only through treatment(1) Median time to catheterization was 23 h for weekday patients and 46 h for weekend patients. Regressed weekday presentation, patient characteristics, and hospital characteristics on early catheterization; weekday presentation was statistically predictive of early catheterization.(2) No statistically significant relationship between weekday presentation and in-hospital outcome after adjustment for patient characteristics, hospital characteristics, and early catheterization.There are no patients who would receive the treatment if the instrument was false but who would not receive the treatment if the instrument was positive.Need to assume that there are no patients who would receive early catheterization if they presented on a weekend but who would not receive early catheterization if they presented on a weekday. This appears reasonable.For patients whose treatment received would not have been changed by the instrument, there is no effect of the instrument on the outcome.Investigators would need to argue that for patients who would never undergo early catheterization, weekend presentation has no effect on in-hospital outcomes. Similarly, for patients who would always undergo early catheterization, weekend presentation has no effect on in-hospital outcomes. This assumption appears plausible.Treatment assignment for 1 patient does not affect the outcome of another patient.This assumption would be violated if the likelihood for a patient to undergo early catheterization is related to the survival of another patient. This assumption is reasonable for 2 patients presenting at 2 different hospitals. This assumption could be violated for 2 patients treated within the same hospital if the mortality experience of patients treated in the hospital influenced the likelihood of early angiography for other patients within the hospital (which appears unlikely).Illustrative Example: Early Versus Late Catheterization in Unstable Angina or ST-Elevation Myocardial Infarction PatientsRyan et al18 compared the effects of early versus late use of catheterization on death, reinfarction, stroke, cardiogenic shock, and congestive heart failure using an observational cohort of patients treated at 310 US hospitals. The authors used weekday (7:01 am Sunday through 4:59 pm Friday) versus weekend (5 pm Friday through 7 am Sunday) presentation as an instrumental variable for early catheterization. They observed 45 548 patients who presented on a weekday and 10 804 who presented on a weekend. The median times to catheterization and to percutaneous coronary intervention were 23.4 hours and 22.6 hours for the weekday group and 46.3 and 44.5 hours for the weekend group, respectively. This observation supports the assumption that weekday is predictive of who receives early catheterization. Table 2 provides justification for each of the instrumental variable assumptions in this particular example.Illustrative Example: Effects of Invasive Cardiac Management on Survival After AMIThe report by Stukel and colleagues19 provides an excellent application of propensity scores and instrumental variable analysis to determine the effects of invasive cardiac management on survival after AMI. The authors implemented a propensity-based matched analysis by first modeling receipt of cardiac catheterization as a function of patient, hospital, and area-level covariates, selecting pairs of patients with similar probabilities of undergoing catheterization in which 1 member of the pair received the procedure and the other member did not, and then estimated the relative risk of mortality in these matched pairs using Cox regression. The authors found a 50% relative reduction in mortality risk for catheterized patients using Cox regression. They next used regional cardiac catheterization rates as instrumental variables to conduct an instrumental variable analysis. The mean cardiac catheterization rates varied from 42.8% to 65% across regions, with corresponding 4-year mortality rates of 43.1% to 38.9% (authors’ Table 4). A crude instrumental variable estimate is thus (43.1−38.9)/(42.8−65)=18.9% absolute mortality reduction and is interpreted as, “If we increased the cardiac catheterization rates by 22%, we would observe an 18.9% reduction in 4-year mortality.” Using the Stata function ivreg, the authors calculated an estimate of a 16% relative or 9.7% absolute (authors’ Table 5) mortality reduction at 4 years.How do we interpret these results? Both estimates indicate a benefit of invasive cardiac management, but the size of the benefit differs (50% versus 16% relative reduction). Three explanations are possible: Both estimates are wrong, only 1 estimate is wrong, or they are estimating different treatment effects. If both estimators estimate the average treatment effect, then if no residual confounding was present and the treatment benefit was constant across patient groups, both estimates should agree. The authors used a linear regression model for their instrumental variable analysis, which assumes the treatment effect is additive. This implies that the instrumental variable estimate of the average treatment effect is measured by the absolute, not the relative, difference. The 4-year absolute mortality benefits are not comparable: 19.1% for the propensity-based matched pairs (authors’ Table 1, difference in mortality rates reported in the last row between matched pairs: 55.4%−36.3%) and 9.7% for the instrumental variable analysis (authors’ Table 5). If a constant treatment effect is found across different risk groups of patients, such a difference implies that the propensity score estimate is biased. Is there evidence of a nonconstant treatment effect? It appears so: The authors reported predicted absolute 1-year mortality benefit ranging from 3.3% in the lowest propensity score decile to 0.8% in the highest decile. The author’s Table 4 also suggests a nonconstant treatment effect across regions. Because the propensity-score–matched estimate and the instrumental variable estimate use different subsets of the original sample, there is no guarantee that these subsets are the same. Thus, there is no reason to expect the estimates to be the same; therefore, both could be correct but targeted to 2 different subpopulations.Sensitivity to Unmeasured ConfoundersSensitivity analyses are another underused and powerful tool. Rosenbaum20 proposed an elegant approach to quantify how study conclusions would be changed by unmeasured confounding. The key idea involves the creation of a confounding measure that quantifies the degree of unmeasured confounding and the use of plausible values of the measure to calculate new P values to quantify how study conclusions would change. The confounding measure is the ratio of odds that 2 patients with identical observed confounders receive the treatment. When this OR is 1, the study is free of unmeasured confounders; when it is larger than 1 (say 2), then 2 patients who appear similar on the measured confounders could differ in their odds of receiving treatment by as much as 2. Rosenbaum developed several formulas for bounds in P values for common test statistics such as the Wilcoxon signed rank statistic and McNemar test statistic. If the study findings remain statistically significant for several plausible values of the OR, the investigator may conclude the study is insensitive to hidden confounders.Three comments are in order. First, there will always be a value of the OR at which the P value changes from statistically significant to not statistically significant; at this value, the investigator would conclude that unobserved confounders could explain the observed association between the treatment and the outcome. Second, the sensitivity analysis requires that the investigator a priori specify plausible values for the confounding measure (the OR). The ORs selected should be problem-specific and will depend on the type and number of measured confounders already included in the model. For example, if only age and sex are used to adjust for treatment selection and risk of reinfarction, then large values of the OR are plausible. On the other hand, if in addition to demographic variables, the investigator includes signs and symptoms on presentation, such as cardiovascular history, preprocedure variables, and contraindications to medications, then large values of the OR are not as plausible. Third, the fact that the study results are insensitive does not mean that no unmeasured confounders exist.Illustrative Example: Validation of Catheterization Guidelines for AMI PatientsAn example of sensitivity analyses can be found in a study validating coronary angiography guidelines for AMI patients.21 The authors used propensity-score matching with 105 clinical variables obtained in ≈20 000 Medicare beneficiaries. Their goal was to estimate the benefit of coronary angiography for patients, which was classified as clinically necessary or clinically appropriate or for which uncertainty of the clinical benefit was present. They found an absolute 3-year survival benefit of 17.6% (95% confidence interval 15.1% to 20.1%) in patients undergoing clinically necessary angiography and a smaller benefit (8.8%; 95% confidence interval 6.8% to 10.7%) in patients for whom the benefit was uncertain. The variation in survival benefits suggests a nonconstant treatment effect.The authors determined that to eliminate the survival benefit in patients for whom the procedure was judged necessary, an unmeasured confounder not related to the 105 observed confounders already included in the model would have to increase the odds of angiography by more than 2. The authors also compared 2-day mortality for the matched pairs assuming any clinically meaningful difference would indicate the presence of residual confounding. They observed a small benefit of 1.5% (95% confidence interval 1.0% to 2.0%) in patients for whom the procedure was necessary. Finally, the authors found a benefit regardless of the hospital’s capability to perform coronary angiography.Clustered DataData are clustered when some units are nested completely within other units. Models to deal with these types of data go by many names: hierarchical models, multilevel models, random-effects models, mixed models, random coefficient models, subject-specific models, and empirical Bayes models. Common outcomes research examples include patients nested within hospitals, patients nested within health plans, and patients nested with surgeons. A common example of nesting involves longitudinal data for which the measurement occasion is nested within the subject, as would be the case when one measures health status on 4 occasions after a cardiac event: baseline, 30 days, 6 months, and 1 year. Clustered data are not unique to health outcomes research; for example, educational researchers often deal with data collected from students who are nested within classrooms or within teachers. Finally, the levels of clustering can be >2, as is the case if longitudinal measures are taken for patients treated within hospitals. Here, 3 clusters are present: occasion nested in patient nested in hospital.Clustered d

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call