Background: Machine learning is an attractive tool for identifying patient subgroups that exhibit heterogeneous treatment effects (HTE) of an intervention. Whether HTE identified using machine learning methods generalize across clinical studies is unclear. Prior work using a causal forest machine learning approach identified hemoglobin glycation index (HGI), body mass index (BMI), and age as determinants of HTE of intensive versus standard glycemic control on all-cause mortality in the ACCORD randomized trial. We examined whether the same three variables defined HTE in the VA Diabetes Trial (VADT), an independent randomized trial of intensive versus standard glycemic control in type 2 diabetes patients. Methods: The VADT study randomized participants to intensive (median hemoglobin A1c [HbA1c] achieved of 6.9%) or standard (median HbA1c achieved of 8.4%) glycemic control. In this secondary analysis comparing results with the ACCORD trial, we included 1789 out of 1791 VADT participants who had non-missing data at baseline for fasting glucose, HbA1c, BMI, age, and lab values and comorbidities included in prior machine learning analysis of the ACCORD trial; median follow-up was 5.6 years. We examined the absolute risk difference in all-cause mortality in four subgroups with HTE for mortality in the ACCORD trial: group 1, HGI < 0.44, BMI < 30kg/m 2 , Age < 61 years; group 2, HGI < 0.44, BMI < 30kg/m 2 , Age ≥ 61 years; group 3, HGI < 0.44, BMI ≥ 30kg/m 2 ; and group 4, HGI ≥ 0.44. We repeated the analysis after using inverse odds of sampling weights to standardize VADT study participants to the ACCORD study participants using individual-level data from both studies. Methods: The absolute all-cause mortality difference between intensive and standard glycemic control in all VADT participants included in this study was 0.8%. The absolute mortality difference did not vary significantly across subgroups: 2.2% (95% CI -6.0, 10) in group 1 (n=140), 3.4% (-7.3, 14.2) in group 2 (n=192), 2.4% (-2.9, 7.7) in group 3 (n=517), and -0.8% (-4.8, 3.1) in group 4 (n=940), where negative values indicate lower mortality in the intensive glycemic control intervention arm. The absolute mortality differences for the corresponding subgroups in the ACCORD trial were: -2.3% (-4.3, -0.2), 0.7% (-1.6, 3.1), 0.9% (0.4, 2.1), 3.7% (1.5, 6.0) in groups 1, 2, 3, and 4 respectively. The pattern of absolute mortality differences across groups 1 through 4 in the ACCORD study did not reproduce in the VADT study even after standardization of study population characteristics with the ACCORD study. Conclusions: Machine learning methods to define HTE subgroups from randomized trials may not be generalizable across study samples and are likely sensitive to variability in the intervention and differences in study populations. Additional work is needed to define consistent HTE that can guide individualized diabetes treatment.
Read full abstract