In 1988 the Lancet published a very large randomised clinical trial of intravenous streptokinase, oral aspirin, both, or neither for treatment of suspected acute myocardial infarction [1]. The ISIS-2 trial recruited 17,187 patients from 417 hospitals. The authors concluded that there were benefits both from streptokinase and from aspirin. This paper contained a complex table reporting subgroup analyses, and rather intriguingly, the first analysis was by astrological birth sign. The results suggested that for people born under the star signs Gemini and Libra, aspirin was no better than placebo; for others, aspirin had a strongly beneficial effect. Why, one might wonder, did a highly respected journal publish such arrant nonsense? The use of patient reported outcomes (PROs) in clinical trials can lead to problems with 'multiple testing', that is, the testing of multiple hypotheses and the associated problem of how to interpret resultant P values [2]. Most commonly, this problem arises from the multi-dimensional nature of many PRO instruments. If a separate hypothesis test is carried out for each scale, the overall probability of making a type 1 error (a false positive result in at least one of the scales) increases with every extra scale tested. One widely used approach to allow for multiple testing is the Bonferroni correction, although this is also recognised to be an over-correction in most situations [2], particularly when the tests are correlated as they tend to be for multiple PROs. Methods such as the Hochberg modification [3] are preferable but are still in general 'conservative' and overcorrect [4]. What is less widely recognised is that subgroup analysis is a more insidious form of multiple testing. Unlike multiple outcome measures, the factors that determine subgroups are likely to be independent of each other. This means the false error rate increases more quickly with extra subgroups tested than it does with extra PRO scales tested. Consider the case of a new treatment, which in truth is no better than current best practise. If, hypothetically, we were to conduct a series of trials, with recruits randomised to either one or the other treatment, and just one outcome variable, by definition, we would expect 5% of these trials to claim a difference that is significant with P < 0.05 that is, a 5% false positive rate. If we were to repeat the analyses for subgroups defined by gender, the probability of obtaining a significant result (i.e., P < 0.05) in one or other of the two subgroups is 0.0975 a false positive risk that is nearly double the nominal 5% (calculations described in Appendix). If we then divide the sample into subgroups of patients above and below the median age, we have an additional 0.0975 probability of falsely claiming a significant difference. Figure 1 shows that if we dichotomise the sample into equal-sized subgroups, successively using four factors that are irrelevant (independent) of the outcome, the false positive rate is over 25% when using a nominal P < 0.05 with each subgroup (see Appendix). Twelve factors would bring this up to a 50% error rate. One review of the literature reported finding as many subgroup analyses as 24 (median four) [5]. P. M. Fayers (El) Institute of Applied Health Sciences, University of Aberdeen Medical School, Foresterhill, Aberdeen AB25 2ZD, UK e-mail: P.Fayers@abdn.ac.uk