Reliable measurements of the intensity of the palaeomagnetic field are notoriously difficult to obtain. The approach generally taken is to produce multiple estimates per rock unit and assume that those which meet certain minimum standards of technical quality will combine to produce an accurate mean. However, here, using results produced from a package of 20th century basaltic lava flows from Mount Etna in Sicily, we demonstrate that this approach can fail. In this case, the application of typical sets of selection criteria actually introduces bias into the mean determination that we measure. We demonstrate that this is caused by two types of non-ideal behaviour acting in combination. The first is a result of the multidomain grains that the samples contain and the second is caused by differences in the natural and laboratory cooling rates. We discuss means of avoiding these sources of error in future palaeointensity studies performed on ancient rocks. We also develop a new, more general reliability criterion which is effective here and which we argue should be applied wherever possible in future palaeointensity studies in conjunction with standard criteria. It requires two distinct types of materials which both produce some good-quality palaeointensity measurements and uses their range of overlap to constrain the true palaeointensity. Both the application of this criterion and general reliability considerations require that future palaeointensity studies should measure many samples per cooling unit and that these should be as diverse in terms of their rock magnetic properties and cooling histories as possible.