Abstract

Quantitative research relies on assessing statistical significance, most commonly expressed in the form of a probability value (p-value). Researchers naturally desire that their studies confirm their research hypotheses, however, many studies provide results that are not statistically significant. The literature provides many examples of erroneous reporting and misguided presentation and description of such results (Parsons, Price, Hiskens, Achten, & Costa, 2012) with many non-significant results not reported at all. While there are issues with the separation of results into the binary categories of ‘significant’ and ‘non-significant’ with no shades of grey (Sterne & Davey Smith, 2001), this method of reporting is commonly accepted as the norm. In this editorial, we provide an explanation of non-significant results, present some potential issues with interpretation and guidance to improve reporting of data and argue why it is important to report all findings, significant or not, of well-designed studies for both effective dissemination and to minimize publication bias. Under-reporting of negative results leads to bias of meta-analyses, wastes resources as others replicate previous studies. There is also an ethical obligation to report studies using human subjects who have volunteered themselves usually at some risk to benefit others (Mlinaric, Horvat, & Supak Smolcic, 2017). The p-value is the most widely used indicator when reporting statistical significance, with the decision level for statistical significance typically set at .05. The p-value represents the evidence for the null hypothesis, where the null hypothesis is the opposite of the study hypothesis—usually the absence of an effect. For example, if the study hypothesis is that working overtime increases anxiety for health workers, the null hypothesis would be that overtime work does not increase anxiety. The null and study hypotheses split the possibilities in half, if one is true, then the other must be false, and hence the p-value also provides information about the study hypothesis. The smaller the p-value, the less likely it is that there is no effect and hence it is more likely that the observed difference represents a true effect. The decision to use a p-value cut-off (or alpha level) of .05 is arbitrary, with a decrease to .01 requiring a much larger sample size to conduct a study with equal power (Greenland et al., 2016). There is a growing trend to report the level of evidence for the effect, rather than the p-value. Irrespective of whether a p-value is explicitly reported, the presentation and interpretation of results which do not support the expected effect are important. The 95% confidence interval (CI) provides an alternative for the p-value when reporting results (Fethney, 2010). The 95% CI demonstrates the precision of the estimate, indicating how large or small the effect might be to this level of confidence. For the case here, the 95% CI will show the range of likely increases in anxiety scores indicated by the study. For statistically significant results, the CI range will not include a null effect. The clinical significance of the effect size and the range demonstrated by the 95% CI should then be considered—is it large enough to warrant changing our approach to staff working overtime or likely to have significant mental health outcomes for those working overtime? A similar approach should be applied to 95% CIs for non-significant results. While a study may have shown a small effect, the range of possible effects will include positive, negative, and null effects. For the case considered here, the possible effects will be increased anxiety, decreased anxiety, and no difference in anxiety for those who work overtime compared to those who do not. The clinical significance of the range of effect sizes indicated by the 95% CI should also be considered. For instance, if the upper bound of the 95% CI is small, we have good evidence that even if there is an effect, it is not likely to have clinical relevance. The p-value represents the likelihood of the null hypothesis for the population from which the subjects were sampled, not the sampled subjects themselves. The p-value simply indicates, “the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the statistical model” (Greenland et al., 2016, p. 340). If the study results found that the health workers in the study working overtime had higher mean anxiety, then this is a statement of fact about the descriptive statistics (e.g. the mean values) and does not require a p-value. However, the real value of the study, and the purpose of most quantitative research, is to demonstrate that the population from which the health workers were sampled has this same effect. This is a statement of inference, indicating that from this study we can infer that other health workers would have the same or similar effect. Our best guess about the effect for the population is the effect we found for our sample, but it is only a guess, and we need to represent the likelihood that we might be mistaken—which is provided by the p-value. This process of inference requires inferential statistics, by using some statistical test with an associated p-value. If our study hypothesis was merely that the health workers in Ward Y of Hospital X working overtime have higher anxiety, then the existence of a positive mean difference meets this hypothesis, no inferential process or p-value is required—however, such a study is unlikely to be of much interest to the readers of this journal. The statistical tests available for inference in health science research is vast (Zellner, Boerst, & Tabb, 2007) and depends on the data, study design, and other considerations regarding the research hypotheses. If the statistical test results in p < .05 we can say, by the rules of this statistical convention, that the study passed the threshold criteria to allow us to assert the inference, and so we can state that the study demonstrates that overtime increases anxiety for health workers in general. The more complex issue is how to report results that do not pass this threshold to inference. Consider the case where we observe a higher mean anxiety for overtime workers, compared to those who do not work overtime, but our inferential test provides p = .1. Many studies would report that ‘there was no effect’ or perhaps ‘those working overtime did not have higher anxiety’. Both statements are incorrect, since we did observe an effect for our study subjects, but we cannot (by the rules of the game) provide an inferential statement about health workers in general. Since p = .1, it is still much more likely that those working overtime have higher anxiety than that there is no effect, and hence the conclusions must be stated carefully. While p-values of .045 and .055 provide similar evidence (Gelman & Stern, 2006), it is important that the researcher does not use a p-value close to, but higher than, .05 as support for the research hypothesis, by referring to it as a ‘trend towards significance’ or ‘almost significant’ (Hewitt, Mitchell, & Torgerson, 2008). We should consider all these possibilities along with the likelihood that there is no effect. The researcher has little control over the effect size and sample variation, although it is essential that these are reported for non-significant results. There is a relationship between the three points above, with the sample size needing to be larger if there is more variation for a given effect size. This forms the basis of sample size calculations to determine the number of subjects required for a study when the effect size and variation are known or estimated (i.e., power analysis). Selecting higher power increases certainty, but at the cost of needing substantially more subjects thus increasing the time and cost of the study. When considering non-significant results, sample size is particularly important for subgroup analyses, which have smaller numbers than the overall study. For instance, a well-powered study may have shown a significant increase in anxiety overall for 100 subjects, but non-significant increases for the smaller female (N = 80) and male (N = 20) subgroups. Researchers may erroneously report that ‘there was an increase in anxiety overall (p = .04) but no increase for males or females (p > .05)’. This statement is not only misleading but also plainly untrue, as it must be the differences in the males and/or females that result in the observed difference overall. The non-significance found for one, or both, gender subgroups can only be due to the smaller numbers available for the subgroup analyses. Another common case is finding similar mean differences for the male and female subgroups, but where the effect for females is statistically significant while the effect for the smaller male subgroup is not. Since the mean differences are similar, the non-significance observed for males is a result of the subgroup being underpowered, and it would be foolhardy to claim that ‘there was a difference for female subjects but not for males’. If we were interested in testing for gender differences of anxiety levels, a more appropriate design would be to recruit equal numbers of men and women and ensure that there was enough power to detect these differences. In randomized controlled trials, blocking is used for important confounding variables so there are enough subjects to conduct a priori sub-group analyses. A more appropriate way to report non-significant results is to report the observed differences (the effect size) along with the p-value and then carefully highlight which results were predicted to be different. This considered approach avoids issues with confusing effect size with significance as providing a p-value without an effect size lacks meaning (Visentin & Hunt, 2017). Often additional variables are measured so it is important to indicate which form part of the hypothesis testing and which are more exploratory. A broader issue with non-significant results is publication bias. Null results are inherently less interesting than results which validate research hypotheses and are hence less likely to be reported. The literature will more frequently report studies with significant differences than studies with non-effects, often referred to as the ‘file drawer phenomenon’ (Miller-Halegoua, 2017). Systematic reviews and meta-analyses drawing on this literature will then be more likely to conclude that there are overall effects than would be the case if all studies were published. Although rejection may be more common, researchers should endeavour to submit null results and editors and reviewers should consider the importance of publishing these studies. More generally, the great majority of studies report a range of both significant and non-significant results, and improving this reporting can enhance the evidence available for clinicians. In tables, all results should be reported whether significant or not with full descriptive statistics (e.g. sample size, mean and standard deviation), and effect sizes (mean difference, odds ratios, etc) with associated confidence intervals. Any future review and/or meta-analysis can then include these data, in the weight of other evidence, that generalize to similar populations using similar designs (Mlinaric et al., 2017). The reporting of non-significant results is problematic, with many pitfalls for the unwary as discussed above. A good understanding of the statistical approach, and clinical knowledge should inform the research objectives and study design. While researchers will always hope that the outcomes of their study will support their hypotheses, an informed and considered approach to statistical communication will improve the understanding of studies where null hypotheses must be retained. It is only after full disclosure of a study's design and findings including non-significant results that we can be earnest regarding the importance of our findings. None declared. All authors have agreed on the final version and meet at least one of the following criteria recommended by the ICMJE (http://www.icmje.org/recommendations/)]: Substantial contributions to conception and design, acquisition of data or analysis and interpretation of data; drafting the article or revising it critically for important intellectual content.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call