Multiplicity 1: Subgroup analyses

Nikolaos Pandis

doi:10.1016/j.ajodo.2012.11.018

Abstract

In orthodontics, as well as other fields, the simplest trial would involve a single outcome measure with a comparison of 2 treatments and no subgroup analyses. However, this is not usually the case; often several outcomes are considered, and several subgroup analyses are performed. A recent article recorded subgroup comparisons in articles published by major dental specialty journals and found that over 57% of the included studies had conducted more than 5 multiple comparisons, and 26% of the studies had over 20 multiple comparisons.1Pandis N. Polychronopoulou A. Makou M. Madianos P. Eliades T. Characteristics of research published in 6 major clinical dental specialty journals.J Evid Base Dent Pract. 2011; 11: 75-83Abstract Full Text Full Text PDF PubMed Scopus (22) Google ScholarWhat are subgroup analyses and why do we do them? Participants in clinical trials might vary in baseline characteristics such as age, sex, and susceptibility for experiencing the outcome of interest, and their responses to the interventions could depend on their baseline characteristics. Subgroup analyses can be undertaken to identify treatment effects in patient groups that share certain baseline characteristics such as same sex, age, socioeconomic status, oral hygiene level, and cooperation during orthodontic treatment.For example, let's assume we are assessing functional appliance vs headgear effectiveness in Class II corrections in adolescent boys and girls. One way to analyze the data would be to perform an overall test between the 2 treatment groups under the assumption that the effects would be the same regardless of sex. However, it could be argued that the response might be different between boys and girls; therefore, those 2 subgroups should be analyzed separately. It is logical to be interested in exploring intervention effects in particular subgroups. However, some potential problems with this approach should be considered.Problems with subgroup analyses1.In previous articles, I discussed in some detail sample-size calculations, and readers can refer to them to review the concepts. Sample-size calculations are usually based on assumptions of the analysis of the main outcome and not on subgroup analyses; therefore, subgroup analyses have low power, since they include a portion of the required sample. If there is biological evidence or evidence from previous research that the response to treatment between boys and girls might be different, then this should be considered at the design stage, and appropriate sample-size calculations should be performed at that time to account for the subgroup comparisons.2.Subgroup tests are usually based on small samples and are likely to give results that might have questionable validity. When we are conducting a statistical test and accept a type I error rate, we are accepting some chance of a false-positive result (ie, a statistically significant treatment effect when in reality no effect exists). The Figure displays the probability of a false-positive result as a function of the number of tests at an alpha level of 0.05. This Figure was derived by plotting the results from the following formula:[1 − (1 − a) n], where a is the alpha level, and n is the number of comparisons, assuming that the tests are independent.It is evident that as the number of analyses increases, so does the probability of observing some false-positive results (multiplicity problem).3.The baseline balance of the treatment group achieved with randomization might be lost between subgroup analyses, and selection bias may arise.Subgroup comparison tests can be manipulated, and “interesting” results might be overemphasized, thus creating false impressions of treatment effectiveness. The greatest problem with subgroup analyses is that one might find no overall significant effect but then carry out exploratory subgroup analyses that were not specified in advance, until a significant effect is found. It has been said that, if you torture your data long enough, it will confess to anything.2Wand D. Bakhai A. Clinical trials: a practical guide to design, analysis and reporting. Remedica, London, United Kingdom2006Google ScholarWith the problems highlighted above about subgroup analyses, the question then becomes: how should we handle subgroup testing?One approach would be to lower the threshold for detecting statistically significant results according to the number of intended comparisons. If 10 subgroup comparisons are planned according to the Bonferroni correction, the new threshold will be 0.05/10 = 0.005. In other words, we are making it more difficult to observe a statistically significant result by lowering the threshold from 0.05 to 0.005.2Wand D. Bakhai A. Clinical trials: a practical guide to design, analysis and reporting. Remedica, London, United Kingdom2006Google ScholarAnother approach for handling subgroup analyses is to perform an interaction test. Briefly, we could say that we have an interaction if the effect of an intervention is modified depending on the level of another predictor (we will describe interaction and confounding in more detail in a future article). In the Class II correction example, we might say that we have an interaction if there is a difference in the effect of the functional appliance compared with the headgear between male and female subjects. In other words, the variable sex modifies the effect of the intervention (functional appliance or headgear) depending on its level (male or female). We would not have an interaction if the effect of the intervention was the same across both sexes. Tests for interaction help to guard against spurious findings, and they are the most effective statistical methods to evaluate impacts in subgroup analyses.ExampleWe will use a hypothetical example to illustrate the problem of overinterpretation from subgroup analyses. In a trial, we are assessing lingual retainer failures bonded with conventional acid etching vs self-etching primer. Table I shows the results of this trial overall and by age group. Overall, the results are statistically significant. In the younger age group (12-14 years), for a difference of 9.4% between acid etching and self-etching primer, we have a nonsignificant finding (P = 0.33), whereas in the older age group (15-18 years), for a similar difference between acid etching and self-etching primer (8.0%), we have a statistically significant finding (P = 0.03). The reason for this aberration is the smaller size of the younger group; this has an influence on the P value, as I explained in a previous article in this column.Table IResults of the effect of etching method on lingual retainer failures per age group and overallAcid etching (n = 240)Self-etching primer (n = 232)Risk difference (95% CI)P value∗From 2-sample test for proportions.Age group (y) 12-1410/40 (25.0%)5/32 (15.6%)9.4% (−9.0%, 27.0%)0.33 15-1838/200 (19.0%)22/200 (11.0%)8.0% (1.0%, 15.0%)0.03 Overall48/240 (20.0%)27/232 (11.6%)5.2% (1.8%, 14.9%)0.01CI, Confidence interval.∗ From 2-sample test for proportions. Open table in a new tab An interaction test in our example will assess whether there is a difference in the effect of the etching method among the age groups and prevent the above anomaly. A formal interaction test requires a statistical comparison, but we can informally assess the interaction by looking at Table I in the “Risk difference” column. The interaction test will compare the difference in the risk of lingual retainer failure between acid etching and self-etching primer in the younger (9.1%) and older age (8.0%) groups. It is obvious that there is no difference of clinical importance (9.1%-8.0% = 1.1%); therefore, any claims of a differential effect depending on age are not supported by our data. In the 12-to-14 age group, the P value of 0.33 is not really telling us that there is no difference in lingual retainer failures between acid etching and self-etching primer but, rather, that we have no evidence, since this group was too small to provide such evidence. Again, as I discussed in previous articles, interpretation based solely on P values could be misleading.Sun et al3Sun X. Briel M. Walter D.S. Guyatt G.H. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses.BMJ. 2010; 340: 850-854Crossref Scopus (479) Google Scholar recently updated the criteria for evaluating credibility of subgroup analyses, and those are shown in Table II. The greater the extent that those criteria are satisfied, the more plausible the subgroup effect.Key points•Keep the emphasis on the overall result.•Prespecify subgroups of interest.•Limit the number of subgroups.•Subgroup analyses should be exploratory.•Subgroup analyses have low power.•Do not overinterpret subgroup findings: subgroup claims are likely to be exaggerated.Table IICriteria for evaluating the credibility of subgroup analyses (after Sun et al3Sun X. Briel M. Walter D.S. Guyatt G.H. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses.BMJ. 2010; 340: 850-854Crossref Scopus (479) Google Scholar)Design Is the subgroup variable a characteristic measured at baseline or after randomization? Is the suggested effect from within rather than between studies? Was the test prespecified? Was the direction of the subgroup effect prespecified? Was the subgroup effect from one of a small number of tests conducted?Analysis Is the interaction test significant? Is the significant subgroup test independent?Context Is the size of the subgroup effect large? Is the interaction consistent across studies? Is the interaction consistent across closely related outcomes in the study? Is there evidence for a biological rationale for the hypothesized interaction? Open table in a new tab The next article will discuss multiple treatments and multiple outcomes. In orthodontics, as well as other fields, the simplest trial would involve a single outcome measure with a comparison of 2 treatments and no subgroup analyses. However, this is not usually the case; often several outcomes are considered, and several subgroup analyses are performed. A recent article recorded subgroup comparisons in articles published by major dental specialty journals and found that over 57% of the included studies had conducted more than 5 multiple comparisons, and 26% of the studies had over 20 multiple comparisons.1Pandis N. Polychronopoulou A. Makou M. Madianos P. Eliades T. Characteristics of research published in 6 major clinical dental specialty journals.J Evid Base Dent Pract. 2011; 11: 75-83Abstract Full Text Full Text PDF PubMed Scopus (22) Google Scholar What are subgroup analyses and why do we do them? Participants in clinical trials might vary in baseline characteristics such as age, sex, and susceptibility for experiencing the outcome of interest, and their responses to the interventions could depend on their baseline characteristics. Subgroup analyses can be undertaken to identify treatment effects in patient groups that share certain baseline characteristics such as same sex, age, socioeconomic status, oral hygiene level, and cooperation during orthodontic treatment. For example, let's assume we are assessing functional appliance vs headgear effectiveness in Class II corrections in adolescent boys and girls. One way to analyze the data would be to perform an overall test between the 2 treatment groups under the assumption that the effects would be the same regardless of sex. However, it could be argued that the response might be different between boys and girls; therefore, those 2 subgroups should be analyzed separately. It is logical to be interested in exploring intervention effects in particular subgroups. However, some potential problems with this approach should be considered. Problems with subgroup analyses1.In previous articles, I discussed in some detail sample-size calculations, and readers can refer to them to review the concepts. Sample-size calculations are usually based on assumptions of the analysis of the main outcome and not on subgroup analyses; therefore, subgroup analyses have low power, since they include a portion of the required sample. If there is biological evidence or evidence from previous research that the response to treatment between boys and girls might be different, then this should be considered at the design stage, and appropriate sample-size calculations should be performed at that time to account for the subgroup comparisons.2.Subgroup tests are usually based on small samples and are likely to give results that might have questionable validity. When we are conducting a statistical test and accept a type I error rate, we are accepting some chance of a false-positive result (ie, a statistically significant treatment effect when in reality no effect exists). The Figure displays the probability of a false-positive result as a function of the number of tests at an alpha level of 0.05. This Figure was derived by plotting the results from the following formula:[1 − (1 − a) n], where a is the alpha level, and n is the number of comparisons, assuming that the tests are independent.It is evident that as the number of analyses increases, so does the probability of observing some false-positive results (multiplicity problem).3.The baseline balance of the treatment group achieved with randomization might be lost between subgroup analyses, and selection bias may arise.Subgroup comparison tests can be manipulated, and “interesting” results might be overemphasized, thus creating false impressions of treatment effectiveness. The greatest problem with subgroup analyses is that one might find no overall significant effect but then carry out exploratory subgroup analyses that were not specified in advance, until a significant effect is found. It has been said that, if you torture your data long enough, it will confess to anything.2Wand D. Bakhai A. Clinical trials: a practical guide to design, analysis and reporting. Remedica, London, United Kingdom2006Google ScholarWith the problems highlighted above about subgroup analyses, the question then becomes: how should we handle subgroup testing?One approach would be to lower the threshold for detecting statistically significant results according to the number of intended comparisons. If 10 subgroup comparisons are planned according to the Bonferroni correction, the new threshold will be 0.05/10 = 0.005. In other words, we are making it more difficult to observe a statistically significant result by lowering the threshold from 0.05 to 0.005.2Wand D. Bakhai A. Clinical trials: a practical guide to design, analysis and reporting. Remedica, London, United Kingdom2006Google ScholarAnother approach for handling subgroup analyses is to perform an interaction test. Briefly, we could say that we have an interaction if the effect of an intervention is modified depending on the level of another predictor (we will describe interaction and confounding in more detail in a future article). In the Class II correction example, we might say that we have an interaction if there is a difference in the effect of the functional appliance compared with the headgear between male and female subjects. In other words, the variable sex modifies the effect of the intervention (functional appliance or headgear) depending on its level (male or female). We would not have an interaction if the effect of the intervention was the same across both sexes. Tests for interaction help to guard against spurious findings, and they are the most effective statistical methods to evaluate impacts in subgroup analyses. 1.In previous articles, I discussed in some detail sample-size calculations, and readers can refer to them to review the concepts. Sample-size calculations are usually based on assumptions of the analysis of the main outcome and not on subgroup analyses; therefore, subgroup analyses have low power, since they include a portion of the required sample. If there is biological evidence or evidence from previous research that the response to treatment between boys and girls might be different, then this should be considered at the design stage, and appropriate sample-size calculations should be performed at that time to account for the subgroup comparisons.2.Subgroup tests are usually based on small samples and are likely to give results that might have questionable validity. When we are conducting a statistical test and accept a type I error rate, we are accepting some chance of a false-positive result (ie, a statistically significant treatment effect when in reality no effect exists). The Figure displays the probability of a false-positive result as a function of the number of tests at an alpha level of 0.05. This Figure was derived by plotting the results from the following formula:[1 − (1 − a) n], where a is the alpha level, and n is the number of comparisons, assuming that the tests are independent.It is evident that as the number of analyses increases, so does the probability of observing some false-positive results (multiplicity problem).3.The baseline balance of the treatment group achieved with randomization might be lost between subgroup analyses, and selection bias may arise. Subgroup comparison tests can be manipulated, and “interesting” results might be overemphasized, thus creating false impressions of treatment effectiveness. The greatest problem with subgroup analyses is that one might find no overall significant effect but then carry out exploratory subgroup analyses that were not specified in advance, until a significant effect is found. It has been said that, if you torture your data long enough, it will confess to anything.2Wand D. Bakhai A. Clinical trials: a practical guide to design, analysis and reporting. Remedica, London, United Kingdom2006Google Scholar With the problems highlighted above about subgroup analyses, the question then becomes: how should we handle subgroup testing? One approach would be to lower the threshold for detecting statistically significant results according to the number of intended comparisons. If 10 subgroup comparisons are planned according to the Bonferroni correction, the new threshold will be 0.05/10 = 0.005. In other words, we are making it more difficult to observe a statistically significant result by lowering the threshold from 0.05 to 0.005.2Wand D. Bakhai A. Clinical trials: a practical guide to design, analysis and reporting. Remedica, London, United Kingdom2006Google Scholar Another approach for handling subgroup analyses is to perform an interaction test. Briefly, we could say that we have an interaction if the effect of an intervention is modified depending on the level of another predictor (we will describe interaction and confounding in more detail in a future article). In the Class II correction example, we might say that we have an interaction if there is a difference in the effect of the functional appliance compared with the headgear between male and female subjects. In other words, the variable sex modifies the effect of the intervention (functional appliance or headgear) depending on its level (male or female). We would not have an interaction if the effect of the intervention was the same across both sexes. Tests for interaction help to guard against spurious findings, and they are the most effective statistical methods to evaluate impacts in subgroup analyses. ExampleWe will use a hypothetical example to illustrate the problem of overinterpretation from subgroup analyses. In a trial, we are assessing lingual retainer failures bonded with conventional acid etching vs self-etching primer. Table I shows the results of this trial overall and by age group. Overall, the results are statistically significant. In the younger age group (12-14 years), for a difference of 9.4% between acid etching and self-etching primer, we have a nonsignificant finding (P = 0.33), whereas in the older age group (15-18 years), for a similar difference between acid etching and self-etching primer (8.0%), we have a statistically significant finding (P = 0.03). The reason for this aberration is the smaller size of the younger group; this has an influence on the P value, as I explained in a previous article in this column.Table IResults of the effect of etching method on lingual retainer failures per age group and overallAcid etching (n = 240)Self-etching primer (n = 232)Risk difference (95% CI)P value∗From 2-sample test for proportions.Age group (y) 12-1410/40 (25.0%)5/32 (15.6%)9.4% (−9.0%, 27.0%)0.33 15-1838/200 (19.0%)22/200 (11.0%)8.0% (1.0%, 15.0%)0.03 Overall48/240 (20.0%)27/232 (11.6%)5.2% (1.8%, 14.9%)0.01CI, Confidence interval.∗ From 2-sample test for proportions. Open table in a new tab An interaction test in our example will assess whether there is a difference in the effect of the etching method among the age groups and prevent the above anomaly. A formal interaction test requires a statistical comparison, but we can informally assess the interaction by looking at Table I in the “Risk difference” column. The interaction test will compare the difference in the risk of lingual retainer failure between acid etching and self-etching primer in the younger (9.1%) and older age (8.0%) groups. It is obvious that there is no difference of clinical importance (9.1%-8.0% = 1.1%); therefore, any claims of a differential effect depending on age are not supported by our data. In the 12-to-14 age group, the P value of 0.33 is not really telling us that there is no difference in lingual retainer failures between acid etching and self-etching primer but, rather, that we have no evidence, since this group was too small to provide such evidence. Again, as I discussed in previous articles, interpretation based solely on P values could be misleading.Sun et al3Sun X. Briel M. Walter D.S. Guyatt G.H. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses.BMJ. 2010; 340: 850-854Crossref Scopus (479) Google Scholar recently updated the criteria for evaluating credibility of subgroup analyses, and those are shown in Table II. The greater the extent that those criteria are satisfied, the more plausible the subgroup effect.Key points•Keep the emphasis on the overall result.•Prespecify subgroups of interest.•Limit the number of subgroups.•Subgroup analyses should be exploratory.•Subgroup analyses have low power.•Do not overinterpret subgroup findings: subgroup claims are likely to be exaggerated.Table IICriteria for evaluating the credibility of subgroup analyses (after Sun et al3Sun X. Briel M. Walter D.S. Guyatt G.H. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses.BMJ. 2010; 340: 850-854Crossref Scopus (479) Google Scholar)Design Is the subgroup variable a characteristic measured at baseline or after randomization? Is the suggested effect from within rather than between studies? Was the test prespecified? Was the direction of the subgroup effect prespecified? Was the subgroup effect from one of a small number of tests conducted?Analysis Is the interaction test significant? Is the significant subgroup test independent?Context Is the size of the subgroup effect large? Is the interaction consistent across studies? Is the interaction consistent across closely related outcomes in the study? Is there evidence for a biological rationale for the hypothesized interaction? Open table in a new tab The next article will discuss multiple treatments and multiple outcomes. We will use a hypothetical example to illustrate the problem of overinterpretation from subgroup analyses. In a trial, we are assessing lingual retainer failures bonded with conventional acid etching vs self-etching primer. Table I shows the results of this trial overall and by age group. Overall, the results are statistically significant. In the younger age group (12-14 years), for a difference of 9.4% between acid etching and self-etching primer, we have a nonsignificant finding (P = 0.33), whereas in the older age group (15-18 years), for a similar difference between acid etching and self-etching primer (8.0%), we have a statistically significant finding (P = 0.03). The reason for this aberration is the smaller size of the younger group; this has an influence on the P value, as I explained in a previous article in this column. CI, Confidence interval. An interaction test in our example will assess whether there is a difference in the effect of the etching method among the age groups and prevent the above anomaly. A formal interaction test requires a statistical comparison, but we can informally assess the interaction by looking at Table I in the “Risk difference” column. The interaction test will compare the difference in the risk of lingual retainer failure between acid etching and self-etching primer in the younger (9.1%) and older age (8.0%) groups. It is obvious that there is no difference of clinical importance (9.1%-8.0% = 1.1%); therefore, any claims of a differential effect depending on age are not supported by our data. In the 12-to-14 age group, the P value of 0.33 is not really telling us that there is no difference in lingual retainer failures between acid etching and self-etching primer but, rather, that we have no evidence, since this group was too small to provide such evidence. Again, as I discussed in previous articles, interpretation based solely on P values could be misleading. Sun et al3Sun X. Briel M. Walter D.S. Guyatt G.H. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses.BMJ. 2010; 340: 850-854Crossref Scopus (479) Google Scholar recently updated the criteria for evaluating credibility of subgroup analyses, and those are shown in Table II. The greater the extent that those criteria are satisfied, the more plausible the subgroup effect.Key points•Keep the emphasis on the overall result.•Prespecify subgroups of interest.•Limit the number of subgroups.•Subgroup analyses should be exploratory.•Subgroup analyses have low power.•Do not overinterpret subgroup findings: subgroup claims are likely to be exaggerated. The next article will discuss multiple treatments and multiple outcomes.

Full Text