Abstract

A frequent goal of clinical research is to evaluate the relationship between risk factors and disease or to assess the efficacy and safety of therapeutic interventions. Because there is variability between individuals in the response to treatment or in the effect of risk factors on disease status, the only way to understand these relationships with certainty would be to evaluate every person in the population of interest—clearly an impossible task. The impossibility of the task means that a study population must be selected by sampling from the population of interest. Once the response is evaluated in a representative sample, principles of statistics are used to help us understand what our observations can tell us about the result we would obtain if we could do the impossible and evaluate everyone. Many readers of and contributors to Ophthalmology are clinicians who do not feel well versed in these statistical principles. In writing this, I hope as a statistically oriented clinician who often tries to bridge the gap between clinicians and statisticians to offer some brief guidance on basic but important issues in a manner easily understood by all.The fact that conclusions must be based on results obtained from a sample rather than the entire population means that mistakes will be made from time to time. Due to individual variability, measurement error, and the luck of the draw in selecting subjects, every study’s results differ to some extent from the elusive truth. If the difference between a study’s results and the truth is large, we might erroneously conclude that an ineffective treatment is effective and subject patients to a useless therapy, wasting precious resources and causing adverse events without any health benefit. This type of error (referred to as α or type I error) is what we try to reduce by requiring that a study’s results be statistically significant. In medical research, we traditionally require that the probability of type I error be ≤5% (a P value ≤ 0.05) before accepting the validity of a result. But even in studies that meet this requirement, some uncertainty always remains as to the difference between the reported results and the truth. Thus, when studying a treatment that has an important potential benefit but is particularly expensive or has significant side effects, additional statistical stringency may be warranted before deeming it to be suitable for widespread use. An example of the other type of error that can be made (referred to as β or type II error) is to conclude erroneously that an effective treatment is ineffective. The consequence of such an error would be to withhold a potentially beneficial treatment from the public.Both of these types of errors arise from substantial deviations between a study’s results and the truth. Improvement in our ability to estimate the effect of a treatment or risk factor on disease is gained by increasing the number of subjects included in the study sample. Although the truth can never be known with certainty, the confidence that our estimate closely approximates the truth increases when the sample is large. Because conducting research can be time consuming and costly, sample size calculations are often performed before initiating a clinical study to determine how many subjects will be required to obtain adequate precision in estimating the treatment effect or other outcome of interest.Sample size calculations require that certain assumptions be made. As an example, imagine a study is planned to evaluate the effect of drug A on intraocular pressure (IOP) and to determine whether its effect is different than that of drug B. To calculate how many subjects should be enrolled, one must specify several things. First, the level of tolerance for both type I and type II errors must be determined. For type I error, this means specifying the P value that is to be considered statistically significant. For type II error, this means specifying the power of the study—the acceptable risk of failing to conclude that drug A is superior to drug B when in reality it is more effective. For most studies, a power of 80% or 90% is chosen, giving a 10% or 20% chance of type II error to detect a particular difference of clinical relevance. Next, the variability in IOP in each of the 2 treatment groups must be specified. Because the calculation is made during the planning phase of the study, the standard deviation of IOP in each group is not yet known and must be estimated from other sources of information about IOP variability. Finally, the magnitude of difference in effect between drug A and drug B that is large enough to be considered clinically relevant must be specified. A sample size calculation for a study with a binary (success/failure) outcome rather than a numeric outcome would be similar, but requires that the proportion of successes in each treatment group be specified. This too is not known in advance and requires making a logical guess about the study outcome based upon past experience with the treatments under investigation, and deciding how large a difference is clinically relevant.If the number of available study subjects is already known to be limited, a variation of the sample size calculation can be performed in which the number of subjects in each group is specified and the power of the study to detect a clinically relevant difference in treatment effect is calculated. As with sample size calculations, power calculations also require assumptions about variability in the outcome measure in the study groups and the difference that is clinically meaningful.Because sample size and power calculations are based on predictions of what the study results will be, they are particularly useful as tools to assist in planning. If it is concluded that the sample size required to perform a study is so large as to make it impractical or that the number of available study subjects is not sufficient to provide adequate power, the design might be altered or the study abandoned altogether before devoting resources to a futile cause. Such calculations are, however, highly dependent upon their underlying assumptions. If, after the data are collected, it is discovered that these assumptions were inaccurate, the power of a study may differ substantially from that which was believed at the outset.Consequently, once a study has been completed and the data are in hand, sample size and power calculations based on such assumptions are not of any further value. Instead, the actual observations made during the study can be used to determine the success in clarifying the difference in outcome between the study groups. There are 2 strategies commonly used in statistics to evaluate such differences: hypothesis testing and confidence intervals (CIs). With hypothesis testing, we use statistical methods to help us decide how likely the difference in outcome between study groups was simply random, or whether the difference is real. In our example of glaucoma therapy, the hypothesis would be “the effects of drug A and drug B on intraocular pressure are the same.” A statistical test is used to determine the strength of the evidence that this hypothesis is true. If the P value is small (<0.05 or some other value that we have selected as representing statistical significance), then we must reject the hypothesis and conclude that drug A and drug B have different effects on IOP. In cases in which statistical significance is not achieved, a post hoc power calculation based upon the actual data (rather than the assumptions made before data collection) and a particular clinically meaningful treatment effect can also be useful to estimate the probability that type II error has occurred. If this probability is large, then additional data are likely to be needed to clarify whether the treatment is efficacious.Although hypothesis testing is useful in determining how likely differences between study groups are real, on its own it does fully clarify the magnitude of the difference between groups. Comparing the mean value or another indicator of the outcome in each group is helpful, but to understand clearly what the data imply about the magnitude of the difference, another statistical tool is particularly useful. Computation of a CI allows the determination of the range of plausible values of the truth. Although we cannot know the true answer with certainty, it is possible to compute a CI to identify the degree of precision of our estimate. Most commonly, 95% CIs are computed, telling us that there is 95% probability that the true answer falls within this range. As the sample size of a study increases, the CI becomes narrower, and thus, our ability to estimate the truth precisely increases.In cases where hypothesis testing has failed to demonstrate statistically significant results, a CI can provide additional important information that complements a post hoc power calculation to understand how likely a clinically meaningful type II error may have occurred. For example, if our study of drugs A and B in lowering IOP fails to detect a difference between the effect of the two drugs (hypothesis test with P>0.05), can we really say there is no difference? By computing a 95% CI for the difference in IOP reduction between the study groups, we can see whether we have ruled out values that are clinically significant as being plausible. If the CI is narrow, perhaps between ±1 mmHg, then we might reasonably conclude that effects of the drugs on IOP are similar. On the other hand, if the CI is wider, perhaps between ±3 mmHg, we should not conclude that the effects are similar. Although our hypothesis test offered no evidence that the effects are different, we also have not ruled out the possibility that the true difference may be quite large. When the CI is wide, the probability value determined from a post hoc power calculation should confirm that a larger study needs to be performed with a sample size sufficient to estimate the outcome more precisely.Because CIs provide important information that complements the results of hypothesis testing and also provide insight into the power of a study by identifying the range of plausible values for the truth, authors should make every effort to include them with their results. Power calculations based on assumptions about the variability in the outcome should be included in study proposals or in the “Methods” of manuscripts describing a study’s design, because they serve as a guide to the practicality of completing a study successfully, but they are not the best way to evaluate whether type II error is likely to have occurred. When statistical significance is not achieved, post hoc power calculations based upon the actual results interpreted in conjunction with CIs provide a better way to evaluate and report whether a completed study was sufficiently large to provide a definitive answer to the question at hand. A frequent goal of clinical research is to evaluate the relationship between risk factors and disease or to assess the efficacy and safety of therapeutic interventions. Because there is variability between individuals in the response to treatment or in the effect of risk factors on disease status, the only way to understand these relationships with certainty would be to evaluate every person in the population of interest—clearly an impossible task. The impossibility of the task means that a study population must be selected by sampling from the population of interest. Once the response is evaluated in a representative sample, principles of statistics are used to help us understand what our observations can tell us about the result we would obtain if we could do the impossible and evaluate everyone. Many readers of and contributors to Ophthalmology are clinicians who do not feel well versed in these statistical principles. In writing this, I hope as a statistically oriented clinician who often tries to bridge the gap between clinicians and statisticians to offer some brief guidance on basic but important issues in a manner easily understood by all. The fact that conclusions must be based on results obtained from a sample rather than the entire population means that mistakes will be made from time to time. Due to individual variability, measurement error, and the luck of the draw in selecting subjects, every study’s results differ to some extent from the elusive truth. If the difference between a study’s results and the truth is large, we might erroneously conclude that an ineffective treatment is effective and subject patients to a useless therapy, wasting precious resources and causing adverse events without any health benefit. This type of error (referred to as α or type I error) is what we try to reduce by requiring that a study’s results be statistically significant. In medical research, we traditionally require that the probability of type I error be ≤5% (a P value ≤ 0.05) before accepting the validity of a result. But even in studies that meet this requirement, some uncertainty always remains as to the difference between the reported results and the truth. Thus, when studying a treatment that has an important potential benefit but is particularly expensive or has significant side effects, additional statistical stringency may be warranted before deeming it to be suitable for widespread use. An example of the other type of error that can be made (referred to as β or type II error) is to conclude erroneously that an effective treatment is ineffective. The consequence of such an error would be to withhold a potentially beneficial treatment from the public. Both of these types of errors arise from substantial deviations between a study’s results and the truth. Improvement in our ability to estimate the effect of a treatment or risk factor on disease is gained by increasing the number of subjects included in the study sample. Although the truth can never be known with certainty, the confidence that our estimate closely approximates the truth increases when the sample is large. Because conducting research can be time consuming and costly, sample size calculations are often performed before initiating a clinical study to determine how many subjects will be required to obtain adequate precision in estimating the treatment effect or other outcome of interest. Sample size calculations require that certain assumptions be made. As an example, imagine a study is planned to evaluate the effect of drug A on intraocular pressure (IOP) and to determine whether its effect is different than that of drug B. To calculate how many subjects should be enrolled, one must specify several things. First, the level of tolerance for both type I and type II errors must be determined. For type I error, this means specifying the P value that is to be considered statistically significant. For type II error, this means specifying the power of the study—the acceptable risk of failing to conclude that drug A is superior to drug B when in reality it is more effective. For most studies, a power of 80% or 90% is chosen, giving a 10% or 20% chance of type II error to detect a particular difference of clinical relevance. Next, the variability in IOP in each of the 2 treatment groups must be specified. Because the calculation is made during the planning phase of the study, the standard deviation of IOP in each group is not yet known and must be estimated from other sources of information about IOP variability. Finally, the magnitude of difference in effect between drug A and drug B that is large enough to be considered clinically relevant must be specified. A sample size calculation for a study with a binary (success/failure) outcome rather than a numeric outcome would be similar, but requires that the proportion of successes in each treatment group be specified. This too is not known in advance and requires making a logical guess about the study outcome based upon past experience with the treatments under investigation, and deciding how large a difference is clinically relevant. If the number of available study subjects is already known to be limited, a variation of the sample size calculation can be performed in which the number of subjects in each group is specified and the power of the study to detect a clinically relevant difference in treatment effect is calculated. As with sample size calculations, power calculations also require assumptions about variability in the outcome measure in the study groups and the difference that is clinically meaningful. Because sample size and power calculations are based on predictions of what the study results will be, they are particularly useful as tools to assist in planning. If it is concluded that the sample size required to perform a study is so large as to make it impractical or that the number of available study subjects is not sufficient to provide adequate power, the design might be altered or the study abandoned altogether before devoting resources to a futile cause. Such calculations are, however, highly dependent upon their underlying assumptions. If, after the data are collected, it is discovered that these assumptions were inaccurate, the power of a study may differ substantially from that which was believed at the outset. Consequently, once a study has been completed and the data are in hand, sample size and power calculations based on such assumptions are not of any further value. Instead, the actual observations made during the study can be used to determine the success in clarifying the difference in outcome between the study groups. There are 2 strategies commonly used in statistics to evaluate such differences: hypothesis testing and confidence intervals (CIs). With hypothesis testing, we use statistical methods to help us decide how likely the difference in outcome between study groups was simply random, or whether the difference is real. In our example of glaucoma therapy, the hypothesis would be “the effects of drug A and drug B on intraocular pressure are the same.” A statistical test is used to determine the strength of the evidence that this hypothesis is true. If the P value is small (<0.05 or some other value that we have selected as representing statistical significance), then we must reject the hypothesis and conclude that drug A and drug B have different effects on IOP. In cases in which statistical significance is not achieved, a post hoc power calculation based upon the actual data (rather than the assumptions made before data collection) and a particular clinically meaningful treatment effect can also be useful to estimate the probability that type II error has occurred. If this probability is large, then additional data are likely to be needed to clarify whether the treatment is efficacious. Although hypothesis testing is useful in determining how likely differences between study groups are real, on its own it does fully clarify the magnitude of the difference between groups. Comparing the mean value or another indicator of the outcome in each group is helpful, but to understand clearly what the data imply about the magnitude of the difference, another statistical tool is particularly useful. Computation of a CI allows the determination of the range of plausible values of the truth. Although we cannot know the true answer with certainty, it is possible to compute a CI to identify the degree of precision of our estimate. Most commonly, 95% CIs are computed, telling us that there is 95% probability that the true answer falls within this range. As the sample size of a study increases, the CI becomes narrower, and thus, our ability to estimate the truth precisely increases. In cases where hypothesis testing has failed to demonstrate statistically significant results, a CI can provide additional important information that complements a post hoc power calculation to understand how likely a clinically meaningful type II error may have occurred. For example, if our study of drugs A and B in lowering IOP fails to detect a difference between the effect of the two drugs (hypothesis test with P>0.05), can we really say there is no difference? By computing a 95% CI for the difference in IOP reduction between the study groups, we can see whether we have ruled out values that are clinically significant as being plausible. If the CI is narrow, perhaps between ±1 mmHg, then we might reasonably conclude that effects of the drugs on IOP are similar. On the other hand, if the CI is wider, perhaps between ±3 mmHg, we should not conclude that the effects are similar. Although our hypothesis test offered no evidence that the effects are different, we also have not ruled out the possibility that the true difference may be quite large. When the CI is wide, the probability value determined from a post hoc power calculation should confirm that a larger study needs to be performed with a sample size sufficient to estimate the outcome more precisely. Because CIs provide important information that complements the results of hypothesis testing and also provide insight into the power of a study by identifying the range of plausible values for the truth, authors should make every effort to include them with their results. Power calculations based on assumptions about the variability in the outcome should be included in study proposals or in the “Methods” of manuscripts describing a study’s design, because they serve as a guide to the practicality of completing a study successfully, but they are not the best way to evaluate whether type II error is likely to have occurred. When statistical significance is not achieved, post hoc power calculations based upon the actual results interpreted in conjunction with CIs provide a better way to evaluate and report whether a completed study was sufficiently large to provide a definitive answer to the question at hand. Post hoc Power CalculationsOphthalmologyVol. 115Issue 11PreviewWe enjoyed reading Dr Smith's editorial on statistical tools,1 and it will serve as a valuable reference for ophthalmologists engaged in clinical research. However, we have a different view of the usefulness of post hoc power calculations. Dr Smith writes, “in cases in which statistical significance is not achieved, a post hoc power calculation based upon the actual data (rather than the assumptions made before data collection) and a particular clinically meaningful treatment effect can also be useful to estimate the probability that type II error has occurred.” The primary problem with this approach is that power formulae are based on plugging in true values of study parameters (such as variance of the outcome), but the actual data collected in any study are estimates that may or may not reflect the true treatment effect. Full-Text PDF

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call