ABSTRACT: Music therapy outcome research frequently involves comparisons between groups on continuous scales of psychological constructs. Describing such differences only through statistical tests reduces the existing complexity and creates an artificial and potentially misleading dichotomy. The interpretation of the magnitude of a difference found on a psychological scale is often not straightforward, but can be greatly facilitated and improved by using effect sizes. Cohen's (1988) benchmarks allow for intuitive judgements and enable comparisons between different scales. An example of a study of the effects of music therapy on self-esteem shows how the use of effect sizes can change the interpretation of a research result. Effect sizes have important applications in many fields of music therapy research, including primary studies as well as meta-analyses, and planning of research as well as reporting of research results. Because of their intuitiveness, they may help to bridge the gap between research and clinical practice. There are many different types of research designs and strategies which are applied in music therapy research, including experimental to descriptive forms of research. Experimental and quasi-experimental research on music therapy outcomes frequently involves a comparison between two groups, such as an experimental group and a control group. Different characteristics can be chosen to represent the outcomes of the groups, including, for example, frequency counts of behaviors and psychological constructs. The latter type of outcome is most commonly measured on a continuous scale. Examples of such scales include severity of psychiatric symptoms, degree of social functioning, and level of self-esteem, among others. The purpose of outcome research that involves comparisons between groups is to draw conclusions about the clinical effects of a treatment procedure (e.g., a music therapy program), when compared with a different procedure (e.g., no treatment, standard care, verbal therapy, or another music therapy program). Whenever the means of two groups are compared, there are various ways of describing the difference. The first question that can be addressed is the size of the difference between these two groups. This question is related to the clinical relevance of the difference (or the effect of treatment) and can be answered using an effect size (ES) calculation. If the two groups represent random samples drawn from larger populations, the second question that can be asked is whether one can be sure that there is a difference between the populations from which our sample was drawn. This question refers to inferences that can be made of the generalizability of the results from a representative sample to a larger population, and can only be answered indirectly with a test of statistical significance, which tells how likely-or unlikely-the sample would be drawn in the absence of a difference between the populations. Researchers in the social sciences have often addressed differences between groups only in terms of inferential statistics, without undertaking a descriptive analysis of the differences in their sample, possibly because they were not aware of the potential for calculating the size of an effect (Cohen, 1988). Problems with Statistical Tests There are several problems with the exclusive use of inferential statistics. First, since tests of significance are designed to accept or reject a null hypothesis, they lead to a dichotomous decision, such as between yes and no, or between and white, suggesting that there is either an effect or no effect. This is inappropriate for research questions where it is relevant to what degree a null hypothesis may be wrong. To stay with this scenario, deciding between black and white ignores the many shades of gray that may exist in between. second, the statistical indices that a test of significance provides are, by nature, not useful (and not intended) as descriptive statistics because they depend on effect size and sample size. …