Background: Stratification is frequently used to ensure balance between arms in randomized clinical trials. Analysis of primary endpoints trials using stratified randomized typically uses a "stratified analysis,” in which the primary endpoint is analyzed separately in each subgroup defined by the stratification factors and then those separate analyses are combined in a weighted fashion based on the size of the subgroups. In our experience designing trials, increasing attention has been paid to stratification factors, with larger numbers of stratification factors being proposed. However, the issues with performing some statistical tests with small sample sizes are well-recognized. We performed computational studies to characterize the power and type-1 error in a clinical trial under development with varying numbers of stratification factors. Patients and Methods: We simulated data per the design assumptions of a randomized Phase 2 acute myeloid leukemia trial in development (design initially based on un-stratified calculations). The endpoint was overall survival and we assumed 1:1 randomization, exponential survival (median 7 vs 12 months in the two arms; corresponds to hazard ratio [HR] = 0.58), 84 total patients, a one-sided alpha of 15% based on a log-rank test at 70 deaths. Simulations were conducted, but with a progressive increase in the number of stratification factors; the HR within each stratification factor subgroup was constant and equal to the original design HR (0.58). Power and type-1 error were estimated using 10,000 replications. All stratification factors had two levels with 50/50 distribution. Dynamic balancing was used for randomization with 75% weighted allocation to the arm with better balance. Results: The design with no stratification factors (first row of table) corresponds to the properties initially used to determine the sample size for the trial. With this Phase 2 sample size (n=84 total), a stratified analysis with 1-2 stratification factors maintains power and type-1 error. With 4 or 6 stratification factors, stratified analyses have decreased power with type-1 error controlled. Stratifying by a subset of factors (1 or 2 of the 4 or 6) decreases power in this setting by a small amount (1-2%). The power and type-1 error of unstratified analyses is minimally impacted in the setting of stratified randomization. Conclusion: In the setting of a modestly-sized randomized Phase 2 trial, stratified analyses accounting for 4 or 6 stratification factors can lead to unacceptable decreases in power. Unstratified analyses or analyses accounting for 1 or 2 of the stratification factors led to small (1-2%) decreases in power. Stratified randomization can ensure balance between arms for covariates. When the covariates used in stratified randomization are prognostic (and with > 1 stratification factor, are independently prognostic), using a stratified analysis will often be associated with an increase in power over an unstratified analysis. That increase in power is not guaranteed, as shown in this example. When the number of stratification factors leads to subgroups with small numbers of patients and/or events, stratified analyses can decrease power. Performing a stratified analysis accounting for a subset of the covariates can improve the power compared to an analysis accounting for all stratification factors, though the power is still lower than in a design with fewer stratification factors. Support: NIH/NCI grants CA180888 and CA180819. Figure 1View largeDownload PPTFigure 1View largeDownload PPT Close modal
Read full abstract