The ideal evaluation of diagnostic test performance requires a reference test that is free of errors. However, for many diseases, obtaining such a "gold standard" reference is either impossible or prohibitively expensive. Estimating test accuracy in the absence of a gold standard is therefore a significant challenge. In this article, we introduce and categorize existing methods for evaluating diagnostic tests without a gold standard, considering factors such as the type and number of tests, as well as the structure of the observed data. For each method, we provide a comprehensive introduction and analysis of its underlying assumptions, model architecture, identifiability, estimation techniques, and inference procedures. We use R to conduct simulations for widely applicable models, validating assumptions, comparing models, and assessing their reliability. Additionally, we present real-world examples along with the corresponding R code for these models, enabling readers to better understand how to apply them effectively. Beyond diagnostic medicine, we underscore that the issue of imperfect gold standards affects other fields, drawing parallels to the noisy label problem in machine learning. By highlighting similarities and differences across these domains, we open pathways for further research. The primary aim of this article is to consolidate existing methods for assessing test accuracy in the absence of a gold standard and to provide practical guidance for researchers seeking to apply these methods effectively.
Read full abstract