Interlaboratory studies in measurement science, including key comparisons, and meta-analyses in several fields, including medicine, serve to intercompare measurement results obtained independently, and typically produce a consensus value for the common measurand that blends the values measured by the participants.Since interlaboratory studies and meta-analyses reveal and quantify differences between measured values, regardless of the underlying causes for such differences, they also provide so-called ‘top–down’ evaluations of measurement uncertainty.Measured values are often substantially over-dispersed by comparison with their individual, stated uncertainties, thus suggesting the existence of yet unrecognized sources of uncertainty (dark uncertainty). We contrast two different approaches to take dark uncertainty into account both in the computation of consensus values and in the evaluation of the associated uncertainty, which have traditionally been preferred by different scientific communities. One inflates the stated uncertainties by a multiplicative factor. The other adds laboratory-specific ‘effects’ to the value of the measurand.After distinguishing what we call recipe-based and model-based approaches to data reductions in interlaboratory studies, we state six guiding principles that should inform such reductions. These principles favor model-based approaches that expose and facilitate the critical assessment of validating assumptions, and give preeminence to substantive criteria to determine which measurement results to include, and which to exclude, as opposed to purely statistical considerations, and also how to weigh them.Following an overview of maximum likelihood methods, three general purpose procedures for data reduction are described in detail, including explanations of how the consensus value and degrees of equivalence are computed, and the associated uncertainty evaluated: the DerSimonian–Laird procedure; a hierarchical Bayesian procedure; and the Linear Pool. These three procedures have been implemented and made widely accessible in a Web-based application (NIST Consensus Builder).We illustrate principles, statistical models, and data reduction procedures in four examples: (i) the measurement of the Newtonian constant of gravitation; (ii) the measurement of the half-lives of radioactive isotopes of caesium and strontium; (iii) the comparison of two alternative treatments for carotid artery stenosis; and (iv) a key comparison where the measurand was the calibration factor of a radio-frequency power sensor.
Read full abstract