The analysis of clinical studies: Comparison of means, part I
The analysis of clinical studies: Comparison of means, part I
- Research Article
- 10.1016/j.ajodo.2015.03.015
- Jun 1, 2015
- American Journal of Orthodontics and Dentofacial Orthopedics
Inference from a sample mean--Part 1.
- Research Article
18
- 10.1002/pst.244
- Nov 29, 2006
- Pharmaceutical Statistics
The power of a clinical trial is partly dependent upon its sample size. With continuous data, the sample size needed to attain a desired power is a function of the within-group standard deviation. An estimate of this standard deviation can be obtained during the trial itself based upon interim data; the estimate is then used to re-estimate the sample size. Gould and Shih proposed a method, based on the EM algorithm, which they claim produces a maximum likelihood estimate of the within-group standard deviation while preserving the blind, and that the estimate is quite satisfactory. However, others have claimed that the method can produce non-unique and/or severe underestimates of the true within-group standard deviation. Here the method is thoroughly examined to resolve the conflicting claims and, via simulation, to assess its validity and the properties of its estimates. The results show that the apparent non-uniqueness of the method's estimate is due to an apparently innocuous alteration that Gould and Shih made to the EM algorithm. When this alteration is removed, the method is valid in that it produces the maximum likelihood estimate of the within-group standard deviation (and also of the within-group means). However, the estimate is negatively biased and has a large standard deviation. The simulations show that with a standardized difference of 1 or less, which is typical in most clinical trials, the standard deviation from the combined samples ignoring the groups is a better estimator, despite its obvious positive bias.
- Research Article
5
- 10.1080/02533839.1999.9670448
- Jan 1, 1999
- Journal of the Chinese Institute of Engineers
Since the 1930s statistical methods applied to industrial process control have received much attention. Many techniques of statistical process control, such as control chart and process capability analysis, require knowledge of process standard deviation. Therefore, estimation of the process standard deviation plays an important role in statistical process control. In this article, we reviewed four widely used estimators of standard deviation and derived the efficiency of each estimator. Then, these four estimators of standard deviation were compared based on their efficiencies. From the comparative studies, we suggest that if the sample size is greater than or equal to 6, the second, third and fourth estimators may be used to estimate process standard deviation. However, if the sample size is less than or equal to 5, then the third and fourth estimators should be preferred.
- Discussion
17
- 10.1213/ane.0000000000004839
- Aug 1, 2020
- Anesthesia & Analgesia
KEY POINT: Analysis of variance (ANOVA) is used to test for the equality of the mean values of a continuous outcome between groups.In this issue of Anesthesia & Analgesia, Sabourdin et al1 report results of a study on the effect of propofol concentrations on pupillary diameter. Patients were randomized to 3 groups based on a targeted propofol concentration. These authors used several analysis techniques, including a comparison of pupillary diameters between the groups with a 1-way analysis of variance (ANOVA). ANOVA is a family of statistical methods to compare the mean values of different groups. The term “analysis of variance” is based on the principle that ANOVA partitions the total observed variability (variance) in the outcome variable into distinct components, as described in more detail below. The most simple ANOVA method is 1-way ANOVA, which involves one categorical independent (predictor) variable—typically, the study group variable—and one continuous dependent (outcome) variable. It extends the 2-sample unpaired t test2 to >2 groups, and tests the null hypothesis that all the population means are equal. In a 1-way ANOVA, the between-group variance is compared to the within-group variance, and the ratio of the 2 is summarized in a so-called F statistic. Assuming the null hypothesis is true, the variability attributable to between-group differences should be relatively low, and most of the observed variability should be attributable to within-group differences. Thus, a large F statistic suggests that the null hypothesis is implausible. If it is larger than a certain threshold—which depends on the α level and the degrees of freedom—the associated P value is lower than α, the null hypothesis is rejected, and the authors can claim an overall between-group difference. While ANOVA tells us whether group mean values are significantly different, it does not tell us which specific groups differ from each other. A variety of post hoc tests are available to address this question, such as the Tukey, Bonferroni, or Dunnett test. These post hoc tests are conceptually similar to performing multiple pairwise t tests, but they adjust for inflation of the type I error risk due to multiple testing.3 The exception is the Fisher least significant difference (LSD) test. The choice depends on which groups are being compared (eg, each group with each other versus with a control group) and on how conservative one wishes to be on the multiplicity adjustment. Valid inferences from a 1-way ANOVA rely on several assumptions being met: The observations are independent of each other (both within and between groups). The dependent variable is approximately normally distributed in each group. The variances in each group are approximately equal. For study designs in which >1 independent variable is of interest, different ANOVA methods are available. A factorial ANOVA allows for >1 categorical independent (predictor) variable. Two-way ANOVA is an example that allows testing for the effects of 2 categorical variables on the outcome (eg, treatment group and patient sex), as well as for the interaction of the 2. Analysis of covariance (ANCOVA) is yet another method that allows adjusting for a continuous covariate (eg, patient age).4 For nonindependent data—quite commonly, data repetitively measured over time in the same subjects—repeated-measures ANOVA methods are available and more appropriate.5Figure.: Figure 1 from Sabourdin et al1 and an excerpt from their results sections. The Figure shows the individual pupillary diameters (circles) and means (solid horizontal line) for each randomized propofol group. The Figure includes a dashed line because authors also tested for a linear relationship between the targeted propofol concentration and pupillary diameter. The 1-way ANOVA F statistic is 15.9, with 2 (numerator) and 37 (denominator) degrees of freedom. The F statistic is a ratio (see text). This F statistic corresponds to a P value of <0.001, allowing the null hypothesis of equal group means to be rejected. ANOVA indicates analysis of variance; Cet, effect site concentration.
- Front Matter
9
- 10.1080/01494929.2016.1199196
- Jun 8, 2016
- Marriage & Family Review
ABSTRACTOccasionally, scientific reports have omitted information on standard deviations, making estimates of effect sizes very difficult to impossible. In such situations, several scholars have recommended obtaining an estimate of the standard deviation of distributions by dividing the range of the distribution (highest value minus lowest value) by four. However, there appears to be little evidence to confirm the validity of this approach. Articles from 2012 to 2015 in the journal Marriage & Family Review were surveyed to find instances where demographic variables (age, education, duration of relationship, number of children) were reported with both standard deviations and ranges. Ratios between range and standard deviations were calculated by several rules of thumb or more complex formulas and compared with the actual ratios obtained. Results indicated that dividing by five in general provided a more accurate estimate of actual standard deviations but accuracy in predicting the true ratio between range and standard deviation was substantially related to the position of the mean score within the range of scores with larger divisors needed as the mean approached either the minimum or the maximum values of the demographic variable (skew). Other recent formulae for estimating the standard deviation were also evaluated, but the skew-based approach appeared to be more accurate than the others. However, further investigation in other samples is needed because the skew-based approach was derived from observation of the data here, which might not replicate in different sets of data.
- Research Article
26
- 10.1158/1078-0432.ccr-06-2533
- Feb 1, 2007
- Clinical Cancer Research
Although the Code of Federal Regulations (21 CFR 312.21) defines phase II studies as “controlled clinical studies,” the vast majority of phase II oncology trials have been single-arm investigations. Granted, a liberal definition of the word “controlled” would allow the use of historical
- Research Article
- 10.1590/s0071-12761956000100009
- Jan 1, 1956
- Anais da Escola Superior de Agricultura Luiz de Queiroz
This paper deals with the estimation of milk production by means of weekly, biweekly, bimonthly observations and also by method known as 6-5-8, where one observation is taken at the 6th week of lactation, another at 5th month and a third one at the 8th month. The data studied were obtained from 72 lactations of the Holstein Friesian breed of the Escola Superior de Agricultura Luiz de Queiroz (Piracicaba), S. Paulo, Brazil), being 6 calvings on each month of year and also 12 first calvings, 12 second calvings, and so on, up to the sixth. The authors criticize the use of maximum error to be found in papers dealing with this subject, and also the use of mean deviation. The former is completely supersed and unadvisable and latter, although equivalent, to a certain extent, to the usual standard deviation, has only 87,6% of its efficiency, according to KENDALL (9, pp. 130-131, 10, pp. 6-7). The data obtained were compared with the actual production, obtained by daily control and the deviations observed were studied. Their means and standard deviations are given on the table IV. Inspite of BOX's recent results (11) showing that with equal numbers in all classes a certain inequality of varinces is not important, the autors separated the methods, before carrying out the analysis of variance, thus avoiding to put together methods with too different standard deviations. We compared the three first methods, to begin with (Table VI). Then we carried out the analysis with the four first methods. (Table VII). Finally we compared the two last methods. (Table VIII). These analysis of variance compare the arithmetic means of the deviations by the methods studied, and this is equivalent to compare their biases. So we conclude tht season of calving and order of calving do not effect the biases, and the methods themselves do not differ from this view point, with the exception of method 6-5-8. Another method of attack, maybe preferrable, would be to compare the estimates of the biases with their expected mean under the null hypothesis (zero) by the t-test. We have: 1) Weekley control: t = x - 0/c(x) = 8,59 - 0/ = 1,56 2) Biweekly control: t = 11,20 - 0/6,21= 1,80 3) Monthly control: t = 7,17 - 0/9,48 = 0,76 4) Bimonthly control: t = - 4,66 - 0/17,56 = -0,26 5) Method 6-5-8 t = 144,89 - 0/22,41 = 6,46*** We denote above by three asterisks, significance the 0,1% level of probability. In this way we should conclude that the weekly, biweekly, monthly and bimonthly methods of control may be assumed to be unbiased. The 6-5-8 method is proved to be positively biased, and here the bias equals 5,9% of the mean milk production. The precision of the methods studied may be judged by their standard deviations, or by intervals covering, with a certain probability (95% for example), the deviation x corresponding to an estimate obtained by cne of the methods studied. Since the difference x - x, where x is the mean of the 72 deviations obtained for each method, has a t distribution with mean zero and estimate of standard deviation. s(x - x) = √1+ 1/72 . s = 1.007. s , and the limit of t for the 5% probability, level with 71 degrees of freedom is 1.99, then the interval to be considered is given by x ± 1.99 x 1.007 s = x ± 2.00. s The intervals thus calculated are given on the table IX.
- Research Article
7
- 10.1016/j.amj.2009.03.002
- May 1, 2009
- Air Medical Journal
Hypothesis Testing
- Research Article
121
- 10.1074/jbc.m605620200
- Feb 1, 2007
- Journal of Biological Chemistry
Extracellular nucleotides, released in response to mechanical or inflammatory stimuli, signal through P2 receptors in many cell types, including osteoblasts. P2X7 receptors are ATP-gated cation channels that can induce formation of large membrane pores. Disruption of the gene encoding the P2X7 receptor leads to decreased periosteal bone formation and insensitivity of the skeleton to mechanical stimulation. Our purpose was to investigate signaling pathways coupled to P2X7 activation in osteoblasts. Live cell imaging showed that ATP or 2 ',3 '-O-(4-benzoylbenzoyl)-ATP (BzATP), but not UTP, UDP, or 2-methylthio-ADP, induced dynamic membrane blebbing in calvarial osteoblasts. Blebbing was observed in calvarial cells from wildtype but not P2X7 knock-out mice. P2X7 receptors coupled to activation of phospholipase D and A2, inhibition of which suppressed BzATP-induced blebbing. Activation of these phospholipases leads to production of lysophosphatidic acid (LPA). LPA caused dynamic blebbing in osteoblasts from both wild-type and P2X7 knock-out mice, similar to that induced by BzATP in wildtype cells. However, LPA-induced blebbing was more rapid in onset and was not affected by inhibition of phospholipase D or A2. Blockade or desensitization of LPA receptors suppressed blebbing in response to LPA and BzATP, without affecting P2X7-stimulated pore formation. Thus, LPA functions downstream of P2X7 receptors to induce membrane blebbing. Furthermore, inhibition of Rho-associated kinase abolished blebbing induced by both BzATP and LPA. In summary, we propose a novel signaling axis that links P2X7 receptors through phospholipases to production of LPA and activation of Rho-associated kinase. This pathway may contribute to P2X7-stimulated osteogenesis during skeletal development and mechanotransduction.
- Research Article
47
- 10.1016/j.amj.2009.04.013
- Jun 30, 2009
- Air Medical Journal
Inferential Statistics
- Research Article
- 10.1139/x72-007
- Mar 1, 1972
- Canadian Journal of Forest Research
The assumption of randomness, underlying the use of range as an estimator of the standard deviation in a normal parent population, was deliberately violated in order to assess how restrictive is this assumption in sampling tree diameters and heights. In only four, out of 34 non-random samples, were the estimates of population standard deviation using range significantly lower than the corresponding root-mean-square estimates. These underestimates were reduced by randomizing the collected data.
- Research Article
23
- 10.1016/j.jksus.2015.02.002
- Mar 18, 2015
- Journal of King Saud University - Science
Methodological insights for industrial quality control management: The impact of various estimators of the standard deviation on the process capability index
- Research Article
- 10.46827/ejes.v7i6.3124
- Jun 16, 2020
- European Journal of Education Studies
The purpose of the study was to investigate pre-service teachers’ view of nature of science (NOS). A descriptive survey design was used for the study. A convenience sampling technique was used to get the participants. Participants were made up of 231 level 100 pre-service teachers (119 males and 112 females) from five colleges of education in Ghana. All the colleges of education were under the same mentor university. Participants completed the view of nature of science questionnaire (NOSQ) through online learning platforms. Data was analyzed using descriptive and inferential statistics. The results revealed that in general pre-service have no adequate conceptions about nature of science. However, pre-service teachers have informed views of some aspects of nature of science. The results revealed that 56 (24.2%) of pre-service teachers have naïve view of NOS. The results also revealed that 89 (38.5%) of pre-service teachers have transitional view of NOS. The results also revealed that 86 (37.2%) of pre-service teachers have informed view of NOS. There was no significant difference in pre-service teachers view of NOS between males (M = 3.76, SD = .389) and females (M = 3.79, SD = .376), t (229) = -.707, p = .48. Therefore, we fail to reject the null hypothesis. One-way analysis of variance (ANOVA) showed no significant difference in pre-service teachers’ view of NOS by programme options, [F (2,228) = .783, p = .458.] Article visualizations:
- Research Article
35
- 10.1161/circulationaha.105.586461
- Aug 21, 2006
- Circulation
In most biomedical research, investigators hypothesize about the relationships of various factors, collect data to test those relationships, and try to draw conclusions about those relationships from the data collected. In many cases, investigators test relationships by comparing the average level of a factor between 2 groups or between 1 group and a standard reference. This framework is as true for understanding the basic role of cardiac myosin binding protein-C phosphorylation in cardiac physiology1 as it is for evaluating non–high-density lipoprotein cholesterol (HDL-C) as a predictor of myocardial infarction in large groups of individuals.2 In this article we describe hypothesis testing, which is the process of drawing conclusions on the basis of statistical testing of collected data, and the specific approach used to test means (or average levels of a collected data element). These concepts are covered in detail in many statistical textbooks at various levels, including Pagano and Gauvreau,3 Zar,4 and Kleinbaum et al.5 The purpose of statistical inference is to draw conclusions about a population on the basis of data obtained from a sample of that population. Hypothesis testing is the process used to evaluate the strength of evidence from the sample and provides a framework for making determinations related to the population, ie, it provides a method for understanding how reliably one can extrapolate observed findings in a sample under study to the larger population from which the sample was drawn. The investigator formulates a specific hypothesis, evaluates data from the sample, and uses these data to decide whether they support the specific hypothesis. The first step in testing hypotheses is the transformation of the research question into a null hypothesis, H, and an alternative hypothesis, HA.6 The null and alternative hypotheses are concise statements, usually in …
- Research Article
- 10.1002/tea.20448
- Nov 3, 2011
- Journal of Research in Science Teaching
In the wake of an increasing political commitment to evidence-based decision making and evidence-based educational reform that emerged with the No Child Left Behind effort, the question of what counts as evidence has become increasingly important in the field of science education. In current public discussions, academics, politicians, and other stakeholders tend to privilege experimental studies and studies using statistics and large sample sizes. However, some science education studies use a lot of statistics and large sample sizes (e.g., Bodzin, 2011) and yet, as I suggest in this text, are flawed and do not provide (sound) evidence in favor of some treatment or claim. Leaving aside the assertion and consensus of researchers across the quantitative/qualitative spectrum (e.g., the collection of chapters in Ercikan & Roth, 2009), we must ask whether all studies that appear to provide “quantitative” support for a particular effect do in fact provide substantial or strong evidence. As an anonymous reviewer of this contribution has pointed out, the question in its title really has two dimensions: (a) What constitutes valid evidence and (b) what are the limits of the claims that can be constructed when the evidentiary chain from premises to results is perfectly constructed. Both are important in constructing explanations for phenomena of interest to scientists generally and to science educators in particular. I begin by discussing the two issues in the context of the logic of scientific inquiry and statistical inference and then exemplify the issues as these play out in one recent article published in the pages of this journal (Bodzin, 2011). To further concretize my discussion, I also sketch two re-analyses concerned with the weight of the evidence provided by (a) 10 studies of paranormal psychological phenomena (psi) and (b) 855 studies in experimental psychology. First, in the logic of science, all explanatory schemas—including those of historical, historical-developmental, or interpretive nature—can be expressed in the following way (e.g., Stegmüller, 1974). Some observed event E (i.e., the evidence) is related to the statements about antecedent conditions and general laws or law-like regularities; together these constitute the premises of the argument made in the research article. The conditions for an explanation to be valid include: (a) the argument that leads from a hypothesized regularity or law to observation has to be correct; (b) there has to be at least one general law or law-like regularity; (c) the hypothesized law/regularity has to include empirical content; and (d) the statements that constitute the law have to be true (based on basic logic, no valid inferences can be made otherwise). In the logic of experimental research, explanations may be of two kind: (a) given the same set of antecedent conditions, a first hypothesized law would lead to observed event E1 whereas a second hypothesized law leads to event E2; or (b) given the same law or law-like regularity, the antecedent conditions would lead to observed E1 whereas a second set of antecedent conditions would lead to E2. Frequency-based statistics are used to establish the probability for an event E to be observed p(E|H0) given the null hypothesis.1 This probability gives only indirect evidence, as the researcher has to choose a certain level at which H0 is rejected. In the social sciences, this probability tends to be α = 0.05.2 The point of scientific research generally is to eliminate alternative hypotheses or theories so that the remaining one(s) constitutes the best available at the time.3 A researcher's claim that some observed event E (i.e., the data collected, which constitutes the evidence) is due to specified antecedents and laws/law-like regularities (a) is strong when there are no other explanations but (b) is less strong, weak, or invalid when there are other explanations. A researcher has conducted many studies or tests within the same report. If s/he based the claims deriving from each study/test on the accepted rule of using a type I error rate of α = 0.05 (also false positive or rejecting true null hypothesis H0), then the accumulation of tests leads to higher possibility to have made at least one type I error. Although an experiment suggests that there is a statistically reliable effect with, for example, a probability p < 0.001, the size of the effect may be negligible in practice and therefore not useful for policy makers. Related to the preceding point, a science educator looking at the PISA 2009 (OECD, 2010) scores would notice that there was a statistically significant difference between boys (XB = 509) and girls (XG = 495) on the science scale (SDPOOL = 98). This would be taken as evidence for the claim “U.S. boys outperformed their female counterparts.” Yet looking at the associated distributions of scores (Figure 1), we immediately realize that there are many girls that outperform boys. There is a large overlap of treated and untreated individuals so that for any given score, the likelihood that the person received treatment is similar to that s/he has not received treatment. Researchers do not take into account previous studies of the same design testing the same variables or theories. Because previous knowledge thereby is not taken into account, we cannot evaluate what additional evidence each study provides. Including prior knowledge that is at the heart of Bayesian statistics (e.g., Howson & Urbach, 1989/2006). A plot of two population distributions representing science scores of U.S. boys and girls based on the means and standard deviation reported by the 2009 PISA study (OECD, 2010). There are many other reasons why a valid inference nevertheless is problematic. It is up to the researcher to exhibit and discuss the strength of a study, the validity of its evidence, and which audiences will draw what kind of benefit from the study results. To illustrate how science educators might want to think about the strength of evidence they produce, I provide an exemplary look on a recent study in earth science education (Bodzin, 2011). Bodin suggests that the purpose of the study was to investigate the “extent [that] a [geospatial information technology (GIT)]-supported curriculum could help students at all ability levels … to understand [land use change (LUC)] concepts and enhance the spatial skills involved with aerial and RS imagery interpretation” (Bodzin, 2011, p. 293). That is, an explicit claim is made about a causal relationship between a curriculum and learning. Following the logic of argumentation outlined above, therefore, the author makes a claim that a particular antecedent (the GIT-supported curriculum) brings about a difference in achievement when the students are observed (tested) before and after the treatment. The study uses a simple pre-test (observation O1)/treatment (X)/post-test (O2) design, which has the structure (Cook & Campbell, 1979). In this case, the authors of the standard reference book on quasi-experimental design suggest “we should usually not expect hard-headed causal inferences from the simple before-after design when it is used by itself” (p. 103). Although the authors suggest that such a design may produce hypotheses worthy of further exploration, they express the hope “that persons considering the use of this design will hesitate before resorting to it” (p. 103). This is so because the difference in test scores (O2 − O1) could be due to maturation or other events in the life of the students (e.g., they learn certain mathematical concepts or concepts in logic). Because Bodzin does not rule out other reasonable alternatives, the design provides weak (little) to no evidence for a treatment effect because there are many other possible causes that could have brought about the differences in achievement between the two observations—even though the statistical tests are significant and even if the effect sizes were large. If we accept for the moment that the study is exploratory, we may ask ourselves whether its evidence has any strength that warrants further study. We then have to choose the form of analysis. Traditionally, there is no question: the statistics would be one based on frequency distributions (e.g., Student's t). Within this frequency-based perspective, the evidence provided by Bodzin's study is not strong even though it might appear as such. A first problem with the results is that the reported means are not independent because each overall means reported in each of Bodzin's Tables 2–4 really is a weighted mean derived from the other pieces of information already available. That is, it is as if the author reported that three individuals had $2, $3, and $4, respectively, and also reported that they owned $9 together or that the mean amount was $3. The additional information is redundant rather than additional evidence; but reporting the redundant information makes it look like there is additional evidence. Statisticians tend to deal with this issue by lowering the degrees of freedom and thereby eliminating redundancy. The study therefore violates some basic assumptions for statistical inference that would be part of the second type of validity. Moreover, the overall means in his Table 2 can be calculated from the scales reported in Tables 3 and 4. To draw any useful conclusions from the p values, however, the tests need to be independent. As presented, the study overestimates the evidence in favor of the treatment. A second major problem is the number of t tests conducted: a total of N = 24, which, given the content of bullet 1 above, tremendously increases the possibility of a type I error. That is, the experiment-wise error rate that there is a false positive actually exceeds 1 (24 × α = 24 × 0.05 = 1.2) and, therefore, would be set to p = 1 in statistical packages such as SAS.4 To hold the experiment-wise error rate at α = 0.05, tests could be adjusted using what is known as the Bonferroni procedure (or one of its alternatives).5 In this procedure, every test in an ensemble of N tests is conducted at a revised α-level of αnew = α/N so that the total, experiment-wise error still is less than α = 0.05. That is, instead of a cut-off at p < 0.05, p < 0.01,… the new cut-offs for rejecting the null hypothesis would be at p < 0.0021, p < 0.00042,… and so on. Again, the reported tests are strongly biased in favor of the reported effects because these are conducted at error probabilities 24 times higher than acceptable. Another option would have been to use a MANOVA, that is, a test with multiple (“M-”) dependent measures tested simultaneously (“ANOVA”). Only when this test suggests a significant difference would more conservative, adjusted t-tests be warranted. Even if these problems did not exist, further caution would be required because frequency-based statistics have some fundamental problems, even flaws. In a recent article of the Journal of Personality and Social Psychology, a group of authors reanalyzes the results of a set of experimental studies to respond to their rhetorical question “Why psychologists must change the way they analyze their data?” (Wagemakers, Wetzels, Borsboom, & van der Maas, 2011). This study was designed as a critique of a series of studies on the psychological phenomenon of psi all conducted by the same researcher (Bem, 2011). Here, the “term psi denotes anomalous processes of information or energy transfer that are currently unexplained in terms of known physical or biological mechanisms” (p. 407). It is a descriptive term that includes, among others, telepathy, clairvoyance, precognition, and premonition. The subject is a controversial one that—recognized as such by the study's author—most psychologists right out reject even though there are significant parts of the general population believing in parapsychological phenomena. The series of studies has fulfilled all the criteria required by the logic of science for valid inference. The study suggests that there is overwhelming, cumulative evidence for the existence of certain psi-related phenomena. However, the critique shows that even though there are nine (of 10) experimental studies conducted by Bem with statistically reliable results in favor of rejecting of the null hypothesis (H0)—that is, H1 = there is no psi [precognition]—much of the evidence is only “anecdotal” in favor of either the null hypothesis (there is no psi) or its alternative (there is psi). To provide evidence for their counter claim, Wagemakers, Wetzels et al. (2011) use a simple Bayesian test that uses the unbiased prior possibilities but is not biased against the null hypothesis as is the frequency based statistics Bem, following standard procedure, used in his study.6 Another investigation reanalyzes 855 studies in experimental psychology and suggests that 70% of the studies with 0.01 < p < 0.05 (i.e., a total of 132 studies) provide no more than anecdotal evidence for the effect of interest (Wetzels et al., 2011). That is, in a field that prides itself for the strength of methodological approaches, a large number of studies that appear to support the alternative to the null hypothesis of no (treatment) effect actually provide evidence that is at best anecdotal. Bayesian statistics have been proposed as a way of overcoming many of the inherent problems with frequency-based statistics not in the least because it allows researchers to quantify prior knowledge (e.g., Gurrin, Kurinczuk, & Burton, 2000). Bayesian statistics are of such a nature that they can be used to provide direct and explicit answers to questions that are usually posed by practitioners. This is so because Bayesian statistics asks what the probability p(H0|E) for the null hypothesis H0 given an event E, which simultaneously yields the probability of the alternative hypothesis p(H1|E) = 1 − p(H0|E). That is, Bayesian statistics evaluates the weight of the evidence from a study in support of one or the other hypothesis. An easy-to-use indicator for the strength of a statistical test is the Bayes factor (Rouder, Speckman, Sun, Morey, & Iverson, 2009).7 Its power derives from the fact that it is not biased—as are p, effect sizes, and confidence intervals—in favor of the alternative hypothesis and therefore provides a measure for the quality of the evidence made for or against claims.8 Tables that map calculated Bayes factors to qualitative expressions of the strength of evidence use a scale from “decisive,” “very strong,” “strong,” “substantial,” and “anecdotal” for both the null and alternative hypothesis (Table 1). Thus, a study that is statistically significant nevertheless may provide little more than anecdotal evidence for the hypothesis that there is an effect. If we assumed for the moment that all of Bodzin's tests are independent and calculated the Bayes factor based on the absence of prior knowledge (equal priors for null and alternative hypothesis), we would obtain the results in Table 1. These shows that six of the tests conducted provide only anecdotal evidence in favor of the alternative hypothesis and four tests provide anecdotal evidence in favor of the null hypothesis. As the implementation of one-tailed tests show, the author appeared to have had good reasons to anticipate positive treatment effects. Such prior beliefs may be used to adjust the statistics to account for prior knowledge. As soon as we assume that there is prior knowledge available in favor of larger effect sizes for the treatment, more of the tests become anecdotal evidence against the claims that the treatment applied by Bodzin caused the differences observed. Moreover, if we removed the overall test to avoid statistical dependence as well as the overall tests for each subscale, then there would be only four decisive tests left, three of which on the same (UHI) scale (Table 1)! Apart from one other test, the remaining evidence would be anecdotal only. The upshot of this ever-so-brief analysis is that the evidence in favor of a treatment effect in Bodzin's study is rather weak and at best anecdotal—apart from being subject to the serious threats to the validity of the experiment deriving from the failure to exclude alternative explanations. Even if all this were not problematic, there would still be the question what the study says to science teachers and policy makers, an issue even for the best-constructed studies such as PISA (OECD, 2010). Thus, as Figure 1 shows, because the overlap between the two distributions is so large—that is, within group variation (SD = 98) large compared to between group variation (XBOYS − XGIRLS = 14)—we do not know whether a particular girl or group of girls can be said to be doing better or worse than boys. Similarly if Figure 1 were to express the results of an experimental or quasi-experimental study, we would be unable to say whether a particular girl or group of girls had benefited from the treatment because she/it achieved higher than some boys but lower than other boys (Ercikan & Roth, 2011). Frequency-based statistics therefore come with considerable limitations concerning the weight and interpretability of the evidence collected in a study. As a result, whether frequency-based statistics can provide useful recommendations to practitioners and policy-makers depends on the degree to which study findings apply to the relevant individual or subgroup of individuals. Science educators, as scholars in any other science, ought to strive to provide the strongest forms of evidence for the claims they make. For the evidence to be strong, the design of studies needs to rule out alternative explanations to the largest extent possible. This pertains to single (qualitative) case studies as to high-powered statistical work using the most advanced mathematical modeling techniques and experimental designs. Moreover, because there are many problems with traditional statistics, some substantial, science educators ought to choose the strongest possible statistical methods available to them. In the face of a public debate about evidence-based decision making in educational reform and in the face of efforts to make evidence-based reasoning itself a primary educational goal (e.g., Callan et al., 2009), science educators do not want to be the children left behind. We, science educators, owe it to ourselves to work together (authors, peer reviewers) to produce the strongest possible evidence in the construction of explanations. 1More technically expressed, the p value a study reports is the probability for a certain effect to occur given the null hypothesis. The probabilities are given by the appropriate distribution, including the standard normal (z), Student's t, χ2, and F distribution and correspond to the fraction of the total area under the distribution (i.e., 0.05, 0.01, …) covered by the tail. 2In a historic-genetic form of explanation used by cultural-historical (activity) theorists frequently cited in the science education literature (e.g., Bakhtin, 1981; Leontyev, 1981; Vygotsky, 1927/1997), general laws are inferred from the observed historically (genetically, developmentally) related sequence of events even though there may only be one case. Here, “the challenge is to systematically interrogate the particular case by constituting it as a ‘particular instance of the possible’ … in order to extract general or invariant properties that can be uncovered only by such interrogation” (Bourdieu, 1992, p. 233). 3This is so even in singular cases (e.g., criminologists are faced with the question “who done it?” and need to get the right person even though there are no precedents). 4http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_multtest_sect014.htm 5The procedure is sometimes critiqued for being too conservative. 6A calculator for this statistics is available at http://pcl.missouri.edu/bayesfactor. The website also provides access to relevant articles. 7Technically, the probability for the null hypothesis following the collection of data is given by , where E denotes the data, BF is the Bayes factor, and π0 and π1 are the prior probabilities of H0 and H1, respectively with π1 = (1 − π0) (Gonen, Johnson, Lu, & Westfall, 2005). 8One of the fundamental lesson beginners in statistics learn is that one “cannot prove or provide evidence for the null hypothesis.”